pythainlp.tokenize

The pythainlp.tokenize contains multiple functions for tokenizing a chunk of Thai text into desirable units.

Modules

pythainlp.tokenize.sent_tokenize(text: str, engine: str = 'whitespace+newline')List[str][source]

This function does not yet automatically recognize when a sentence actually ends. Rather it helps split text where white space and a new line is found.

Parameters
  • text (str) – the text to be tokenized

  • engine (str) – choose between ‘whitespace’ or ‘whitespace+newline’

Returns

list of sentences

pythainlp.tokenize.word_tokenize(text: str, custom_dict: Optional[marisa_trie.Trie] = None, engine: str = 'newmm', keep_whitespace: bool = True)List[str][source]
Parameters
  • text (str) – text to be tokenized

  • engine (str) – tokenizer to be used

  • custom_dict (dict) – a dictionary trie

  • keep_whitespace (bool) – True to keep whitespaces, a common mark for end of phrase in Thai

Returns

list of words

Options for engine
  • newmm (default) - dictionary-based, Maximum Matching + Thai Character Cluster

  • longest - dictionary-based, Longest Matching

  • deepcut - wrapper for deepcut, language-model-based https://github.com/rkcosmos/deepcut

  • icu - wrapper for ICU (International Components for Unicode, using PyICU), dictionary-based

  • ulmfit - for thai2fit

  • a custom_dict can be provided for newmm, longest, and deepcut

Example
>>> from pythainlp.tokenize import word_tokenize
>>> text = "โอเคบ่พวกเรารักภาษาบ้านเกิด"
>>> word_tokenize(text, engine="newmm")
['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']
>>> word_tokenize(text, engine="icu")
['โอ', 'เค', 'บ่', 'พวก', 'เรา', 'รัก', 'ภาษา', 'บ้าน', 'เกิด']
pythainlp.tokenize.syllable_tokenize(text: str)List[str][source]
Parameters

text (str) – input string to be tokenized

Returns

list of syllables

pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc')List[str][source]
Parameters
  • text (str) – text to be tokenized

  • engine (str) – subword tokenizer

Returns

list of subwords

Options for engine
  • tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)

  • etcc - Enhanced Thai Character Cluster (Inrut et al. 2001) [In development]

pythainlp.tokenize.dict_trie(dict_source: Union[str, Iterable[str], marisa_trie.Trie])marisa_trie.Trie[source]

Create a dict trie which will be used for word_tokenize() function. For more information on the trie data structure, see: https://marisa-trie.readthedocs.io/en/latest/index.html

Parameters

dict_source (string/list) – a list of vocaburaries or a path to source file

Returns

a trie created from a dictionary input

class pythainlp.tokenize.Tokenizer(custom_dict: Optional[Union[marisa_trie.Trie, Iterable[str], str]] = None, engine: str = 'newmm')[source]
set_tokenize_engine(engine: str)None[source]
Parameters

engine (str) – choose between different options of engine to token (newmm, mm, longest)

word_tokenize(text: str)List[str][source]
Parameters

text (str) – text to be tokenized

Returns

list of words, tokenized from the text

NEWMM

pythainlp.tokenize.newmm.segment(text: str, custom_dict: Optional[marisa_trie.Trie] = None)List[str][source]

Dictionary-based word segmentation, using maximal matching algorithm and Thai Character Cluster :param str text: text to be tokenized to words :return: list of words, tokenized from the text

TCC

Thai Character Cluster

pythainlp.tokenize.tcc.segment(text: str)List[str][source]

Subword segmentation :param str text: text to be tokenized to character clusters :return: list of subwords (character clusters), tokenized from the text

pythainlp.tokenize.tcc.tcc(text: str)str[source]

TCC generator, generates Thai Character Clusters :param str text: text to be tokenized to character clusters :return: subword (character cluster)

pythainlp.tokenize.tcc.tcc_pos(text: str)Set[int][source]

TCC positions :param str text: text to be tokenized to character clusters :return: list of the end of subwords