pythainlp.tokenize¶

The pythainlp.tokenize contains multiple functions for tokenizing a chunk of Thai text into desirable units.

Modules¶

pythainlp.tokenize.sent_tokenize(text: str, engine: str = 'whitespace+newline') → List[str][source]¶

This function does not yet automatically recognize when a sentence actually ends. Rather it helps split text where white space and a new line is found.

Parameters

text (str) – the text to be tokenized
engine (str) – choose between ‘whitespace’ or ‘whitespace+newline’

Returns

list of sentences

pythainlp.tokenize.word_tokenize(text: str, custom_dict: Optional[marisa_trie.Trie] = None, engine: str = 'newmm', keep_whitespace: bool = True) → List[str][source]¶

Parameters

text (str) – text to be tokenized
engine (str) – tokenizer to be used
custom_dict (dict) – a dictionary trie
keep_whitespace (bool) – True to keep whitespaces, a common mark for end of phrase in Thai

Returns

list of words

Options for engine

newmm (default) - dictionary-based, Maximum Matching + Thai Character Cluster
longest - dictionary-based, Longest Matching
deepcut - wrapper for deepcut, language-model-based https://github.com/rkcosmos/deepcut
icu - wrapper for ICU (International Components for Unicode, using PyICU), dictionary-based
ulmfit - for thai2fit
a custom_dict can be provided for newmm, longest, and deepcut

Example

>>> from pythainlp.tokenize import word_tokenize
>>> text = "โอเคบ่พวกเรารักภาษาบ้านเกิด"
>>> word_tokenize(text, engine="newmm")
['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']
>>> word_tokenize(text, engine="icu")
['โอ', 'เค', 'บ่', 'พวก', 'เรา', 'รัก', 'ภาษา', 'บ้าน', 'เกิด']

pythainlp.tokenize.syllable_tokenize(text: str) → List[str][source]¶

Parameters: text (str) – input string to be tokenized
Returns: list of syllables

pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc') → List[str][source]¶

Parameters

text (str) – text to be tokenized
engine (str) – subword tokenizer

Returns

list of subwords

Options for engine

tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
etcc - Enhanced Thai Character Cluster (Inrut et al. 2001) [In development]

pythainlp.tokenize.dict_trie(dict_source: Union[str, Iterable[str], marisa_trie.Trie]) → marisa_trie.Trie[source]¶

Create a dict trie which will be used for word_tokenize() function. For more information on the trie data structure, see: https://marisa-trie.readthedocs.io/en/latest/index.html

Parameters: dict_source (string/list) – a list of vocaburaries or a path to source file
Returns: a trie created from a dictionary input

class pythainlp.tokenize.Tokenizer(custom_dict: Optional[Union[marisa_trie.Trie, Iterable[str], str]] = None, engine: str = 'newmm')[source]¶

set_tokenize_engine(engine: str) → None [source]¶

Parameters: engine (str) – choose between different options of engine to token (newmm, mm, longest)

word_tokenize(text: str) → List[str][source]¶

Parameters: text (str) – text to be tokenized
Returns: list of words, tokenized from the text

NEWMM¶

pythainlp.tokenize.newmm.segment(text: str, custom_dict: Optional[marisa_trie.Trie] = None) → List[str][source]¶: Dictionary-based word segmentation, using maximal matching algorithm and Thai Character Cluster :param str text: text to be tokenized to words :return: list of words, tokenized from the text

TCC¶

Thai Character Cluster

pythainlp.tokenize.tcc.segment(text: str) → List[str][source]¶: Subword segmentation :param str text: text to be tokenized to character clusters :return: list of subwords (character clusters), tokenized from the text

pythainlp.tokenize.tcc.tcc(text: str) → str [source]¶: TCC generator, generates Thai Character Clusters :param str text: text to be tokenized to character clusters :return: subword (character cluster)

pythainlp.tokenize.tcc.tcc_pos(text: str) → Set[int][source]¶: TCC positions :param str text: text to be tokenized to character clusters :return: list of the end of subwords