pythainlp.tokenize

The pythainlp.tokenize contains multiple functions for tokenizing a chunk of Thai text into desirable units.

Modules

pythainlp.tokenize.clause_tokenize(doc: List[str]) → List[List[str]][source]

Clause tokenizer. (or Clause segmentation) Tokenizes running word list into list of clauses (list of strings). split by CRF trained on Blackboard Treebank.

Parameters:: doc (str) – word list to be clause
Returns:: list of claues
Return type:: list[list[str]]
Example:

Clause tokenizer::: from pythainlp.tokenize import clause_tokenize clause_tokenize([“ฉัน”,”นอน”,”และ”,”คุณ”,”เล่น”,”มือถือ”,”ส่วน”,”น้อง”,”เขียน”,”โปรแกรม”]) # [[‘ฉัน’, ‘นอน’], # [‘และ’, ‘คุณ’, ‘เล่น’, ‘มือถือ’], # [‘ส่วน’, ‘น้อง’, ‘เขียน’, ‘โปรแกรม’]]

pythainlp.tokenize.sent_tokenize(text: str, engine: str = 'crfcut', keep_whitespace: bool = True) → List[str][source]

Sentence tokenizer.

Tokenizes running text into “sentences”

Parameters:

text (str) – the text to be tokenized
engine (str) – choose among ‘crfcut’, ‘whitespace’, ‘whitespace+newline’

Returns:

list of splited sentences

Return type:

list[str]

Options for engine

crfcut - (default) split by CRF trained on TED dataset
thaisum - The implementation of sentence segmentator from Nakhun Chumpolsathien, 2020
tltk - split by TLTK.,
whitespace+newline - split by whitespaces and newline.
whitespace - split by whitespaces. Specifiaclly, with regex pattern r" +"

Example:

Split the text based on whitespace:

from pythainlp.tokenize import sent_tokenize

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="whitespace")
# output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม']

sent_tokenize(sentence_2, engine="whitespace")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ',
#   '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Split the text based on whitespace and newline:

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="whitespace+newline")
# output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม']
sent_tokenize(sentence_2, engine="whitespace+newline")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ',
'\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Split the text using CRF trained on TED dataset:

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="crfcut")
# output: ['ฉันไปประชุมเมื่อวันที่ 11 มีนาคม']

sent_tokenize(sentence_2, engine="crfcut")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ ',
'และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค']

pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc', keep_whitespace: bool = True) → List[str][source]

Subword tokenizer. Can be smaller than syllable.

Tokenizes text into inseparable units of Thai contiguous characters namely Thai Character Clusters (TCCs) TCCs are the units based on Thai spelling feature that could not be separated any character further such as ‘ก็’, ‘จะ’, ‘ไม่’, and ‘ฝา’. If the following units are separated, they could not be spelled out. This function apply the TCC rules to tokenizes the text into the smallest units.

For example, the word ‘ขนมชั้น’ would be tokenized into ‘ข’, ‘น’, ‘ม’, and ‘ชั้น’.

Parameters:

text (str) – text to be tokenized
engine (str) – the name subword tokenizer

Returns:

list of subwords

Return type:

list[str]

Options for engine

dict - newmm word tokenizer with a syllable dictionary
etcc - Enhanced Thai Character Cluster (Inrut et al. 2001)
ssg - CRF syllable segmenter for Thai
tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
tcc_p - Thai Character Cluster + improve the rule that used in newmm
tltk - syllable tokenizer from tltk
wangchanberta - SentencePiece from wangchanberta model

Example:

Tokenize text into subword based on tcc:

from pythainlp.tokenize import subword_tokenize

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='tcc')
# output: ['ยุ', 'ค', 'เริ่ม', 'แร', 'ก',
#   'ข', 'อ', 'ง', ' ', 'รา', 'ช', 'ว', 'ง',
#   'ศ', '์', 'ห', 'มิ', 'ง']

subword_tokenize(text_2, engine='tcc')
# output: ['ค', 'วา', 'ม', 'แป', 'ล', 'ก', 'แย', 'ก',
'และ', 'พัฒ','นา', 'กา', 'ร']

Tokenize text into subword based on etcc:

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='etcc')
# output: ['ยุคเริ่มแรกของ ราชวงศ์หมิง']

subword_tokenize(text_2, engine='etcc')
# output: ['ความแปลกแยกและ', 'พัฒ', 'นาการ']

Tokenize text into subword based on wangchanberta:

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='wangchanberta')
# output: ['▁', 'ยุค', 'เริ่มแรก', 'ของ', '▁', 'ราชวงศ์', 'หมิง']

subword_tokenize(text_2, engine='wangchanberta')
# output: ['▁ความ', 'แปลก', 'แยก', 'และ', 'พัฒนาการ']

pythainlp.tokenize.word_tokenize(text: str, custom_dict: Trie | None = None, engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True) → List[str][source]

Word tokenizer.

Tokenizes running text into words (list of strings).

Parameters:

text (str) – text to be tokenized
engine (str) – name of the tokenizer to be used
custom_dict (pythainlp.util.Trie) – dictionary trie
keep_whitespace (bool) – True to keep whitespaces, a common mark for end of phrase in Thai. Otherwise, whitespaces are omitted.
join_broken_num (bool) – True to rejoin formatted numeric that could be wrongly separated. Otherwise, formatted numeric could be wrongly separated.

Returns:

list of words

Return type:

List[str]

Options for engine

attacut - wrapper for AttaCut., learning-based approach
deepcut - wrapper for DeepCut, learning-based approach
icu - wrapper for a word tokenizer in PyICU., from ICU (International Components for Unicode), dictionary-based
longest - dictionary-based, longest matching
mm - “multi-cut”, dictionary-based, maximum matching
nercut - dictionary-based, maximal matching, constrained with Thai Character Cluster (TCC) boundaries, combining tokens that are parts of the same named-entity
newmm (default) - “new multi-cut”, dictionary-based, maximum matching, constrained with Thai Character Cluster (TCC) boundaries with improve the TCC rule that used in newmm.
newmm-safe - newmm, with a mechanism to avoid long processing time for text with continuous ambiguous breaking points
nlpo3 - wrapper for a word tokenizer in nlpO3., newmm adaptation in Rust (2.5x faster)
oskut - wrapper for OSKut., Out-of-domain StacKed cut for Word Segmentation
sefr_cut - wrapper for SEFR CUT., Stacked Ensemble Filter and Refine for Word Segmentation
tltk - wrapper for TLTK.,

maximum collocation approach

Note:

The custom_dict parameter only works for deepcut, longest, newmm, and newmm-safe engines.

Example:

Tokenize text with different tokenizer:

from pythainlp.tokenize import word_tokenize

text = "โอเคบ่พวกเรารักภาษาบ้านเกิด"

word_tokenize(text, engine="newmm")
# output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']

word_tokenize(text, engine='attacut')
# output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']

Tokenize text by omiting whitespaces:

text = "วรรณกรรม ภาพวาด และการแสดงงิ้ว "

word_tokenize(text, engine="newmm")
# output:
# ['วรรณกรรม', ' ', 'ภาพวาด', ' ', 'และ', 'การแสดง', 'งิ้ว', ' ']

word_tokenize(text, engine="newmm", keep_whitespace=False)
# output: ['วรรณกรรม', 'ภาพวาด', 'และ', 'การแสดง', 'งิ้ว']

Join broken formatted numeric (e.g. time, decimals, IP address):

text = "เงิน1,234บาท19:32น 127.0.0.1"

word_tokenize(text, engine="attacut", join_broken_num=False)
# output:
# ['เงิน', '1', ',', '234', 'บาท', '19', ':', '32น', ' ',
#  '127', '.', '0', '.', '0', '.', '1']

word_tokenize(text, engine="attacut", join_broken_num=True)
# output:
# ['เงิน', '1,234', 'บาท', '19:32น', ' ', '127.0.0.1']

Tokenize with default and custom dictionary:

from pythainlp.corpus.common import thai_words
from pythainlp.tokenize import dict_trie

text = 'ชินโซ อาเบะ เกิด 21 กันยายน'

word_tokenize(text, engine="newmm")
# output:
# ['ชิน', 'โซ', ' ', 'อา', 'เบะ', ' ',
#  'เกิด', ' ', '21', ' ', 'กันยายน']

custom_dict_japanese_name = set(thai_words()
custom_dict_japanese_name.add('ชินโซ')
custom_dict_japanese_name.add('อาเบะ')

trie = dict_trie(dict_source=custom_dict_japanese_name)

word_tokenize(text, engine="newmm", custom_dict=trie))
# output:
# ['ชินโซ', ' ', 'อาเบะ', ' ',
#  'เกิด', ' ', '21', ' ', 'กันยายน']

pythainlp.tokenize.word_detokenize(segments: List[List[str]] | List[str], output: str = 'str') → str | List[str][source]

Word detokenizer.

This function will detokenize the list word in each sentence to text.

Parameters:

segments (str) – List sentences with list words.
output (str) – the output type (str or list)

Returns:

the thai text

Return type:

Union[str,List[str]]

class pythainlp.tokenize.Tokenizer(custom_dict: Trie | Iterable[str] | str | None = None, engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True)[source]

Tokenizer class, for a custom tokenizer.

This class allows users to pre-define custom dictionary along with tokenizer and encapsulate them into one single object. It is an wrapper for both two functions including pythainlp.tokenize.word_tokenize(), and pythainlp.util.dict_trie()

Example:

Tokenizer object instantiated with pythainlp.util.Trie:

from pythainlp.tokenize import Tokenizer
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie

custom_words_list = set(thai_words())
custom_words_list.add('อะเฟเซีย')
custom_words_list.add('Aphasia')
trie = dict_trie(dict_source=custom_words_list)

text = "อะเฟเซีย (Aphasia*) เป็นอาการผิดปกติของการพูด"
_tokenizer = Tokenizer(custom_dict=trie, engine='newmm')
_tokenizer.word_tokenize(text)
# output: ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ',
'ผิดปกติ', 'ของ', 'การ', 'พูด']

Tokenizer object instantiated with a list of words:

text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด"
_tokenizer = Tokenizer(custom_dict=list(thai_words()), engine='newmm')
_tokenizer.word_tokenize(text)
# output:
# ['อะ', 'เฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ',
#   'ผิดปกติ', 'ของ', 'การ', 'พูด']

Tokenizer object instantiated with a file path containing list of word separated with newline and explicitly set a new tokenizer after initiation:

PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt'

# write a file
with open(PATH_TO_CUSTOM_DICTIONARY, 'w', encoding='utf-8') as f:
    f.write('อะเฟเซีย\nAphasia\nผิด\nปกติ')

text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด"

# initate an object from file with `attacut` as tokenizer
_tokenizer = Tokenizer(custom_dict=PATH_TO_CUSTOM_DICTIONARY, \
    engine='attacut')

_tokenizer.word_tokenize(text)
# output:
# ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิด',
#   'ปกติ', 'ของ', 'การ', 'พูด']

# change tokenizer to `newmm`
_tokenizer.set_tokenizer_engine(engine='newmm')
_tokenizer.word_tokenize(text)
# output:
# ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็นอาการ', 'ผิด',
#   'ปกติ', 'ของการพูด']

__init__(custom_dict: Trie | Iterable[str] | str | None = None, engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True)[source]

Initialize tokenizer object.

Parameters:

custom_dict (str) – a file path, a list of vocaburaies* to be used to create a trie, or an instantiated pythainlp.util.Trie object.
engine (str) – choose between different options of engine to token (i.e. newmm, mm, longest, deepcut)
keep_whitespace (bool) – True to keep whitespaces, a common mark for end of phrase in Thai

word_tokenize(text: str) → List[str][source]

Main tokenization function.

Parameters:: text (str) – text to be tokenized
Returns:: list of words, tokenized from the text
Return type:: list[str]

set_tokenize_engine(engine: str) → None[source]

Set the tokenizer’s engine.

Parameters:: engine (str) – choose between different options of engine to token (i.e. newmm, mm, longest, deepcut)

Tokenization Engines

Sentence level

crfcut

CRFCut - Thai sentence segmenter.

Thai sentence segmentation using conditional random field, default model trained on TED dataset

Performance: - ORCHID - space-correct accuracy 87% vs 95% state-of-the-art

(Zhou et al, 2016; https://www.aclweb.org/anthology/C16-1031.pdf)

TED dataset - space-correct accuracy 82%

See development notebooks at https://github.com/vistec-AI/ted_crawler; POS features are not used due to unreliable POS tagging available

pythainlp.tokenize.crfcut.extract_features(doc: List[str], window: int = 2, max_n_gram: int = 3) → List[List[str]][source]

Extract features for CRF by sliding max_n_gram of tokens for +/- window from the current token

Parameters:

doc (List[str]) – tokens from which features are to be extracted from
window (int) – size of window before and after the current token
max_n_gram (int) – create n_grams from 1-gram to max_n_gram-gram within the window

Returns:

list of lists of features to be fed to CRF

pythainlp.tokenize.crfcut.segment(text: str) → List[str][source]

CRF-based sentence segmentation.

Parameters:: text (str) – text to be tokenized to sentences
Returns:: list of words, tokenized from the text

pythainlp.tokenize.crfcut.extract_features(doc: List[str], window: int = 2, max_n_gram: int = 3) → List[List[str]][source]

Extract features for CRF by sliding max_n_gram of tokens for +/- window from the current token

Parameters:

doc (List[str]) – tokens from which features are to be extracted from
window (int) – size of window before and after the current token
max_n_gram (int) – create n_grams from 1-gram to max_n_gram-gram within the window

Returns:

list of lists of features to be fed to CRF

pythainlp.tokenize.crfcut.segment(text: str) → List[str][source]

CRF-based sentence segmentation.

Parameters:: text (str) – text to be tokenized to sentences
Returns:: list of words, tokenized from the text

thaisumcut

The implementation of sentence segmentator from Nakhun Chumpolsathien, 2020 original code from: https://github.com/nakhunchumpolsathien/ThaiSum

Cite:

@mastersthesis{chumpolsathien_2020,: title={Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization}, author={Chumpolsathien, Nakhun}, year={2020}, school={Beijing Institute of Technology}

ThaiSum License

Copyright [2020 [Nakhun Chumpolsathien]

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pythainlp.tokenize.thaisumcut.list_to_string(list: List[str]) → str[source]

pythainlp.tokenize.thaisumcut.middle_cut(sentences: List[str]) → List[str][source]

class pythainlp.tokenize.thaisumcut.ThaiSentenceSegmentor[source]

split_into_sentences(text: str, isMiddleCut: bool = False) → List[str][source]

pythainlp.tokenize.thaisumcut.list_to_string(list: List[str]) → str[source]

pythainlp.tokenize.thaisumcut.middle_cut(sentences: List[str]) → List[str][source]

class pythainlp.tokenize.thaisumcut.ThaiSentenceSegmentor[source]

split_into_sentences(text: str, isMiddleCut: bool = False) → List[str][source]

Word level

attacut

Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai

See Also:

GitHub repository

class pythainlp.tokenize.attacut.AttacutTokenizer(model='attacut-sc')[source]

__init__(model='attacut-sc')[source]

tokenize(text: str) → List[str][source]

pythainlp.tokenize.attacut.segment(text: str, model: str = 'attacut-sc') → List[str][source]

Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai :param str text: text to be tokenized to words :param str model: word tokenizer model to be tokenized to words :return: list of words, tokenized from the text :rtype: list[str] Options for model

attacut-sc (default) using both syllable and character features

attacut-c using only character feature

class pythainlp.tokenize.attacut.AttacutTokenizer(model='attacut-sc')[source]

__init__(model='attacut-sc')[source]

tokenize(text: str) → List[str][source]

deepcut

Wrapper for deepcut Thai word segmentation. deepcut is a Thai word segmentation library using 1D Convolution Neural Network.

User need to install deepcut (and its dependency: tensorflow) by themselves.

See Also:

GitHub repository

pythainlp.tokenize.deepcut.segment(text: str, custom_dict: Trie | List[str] | str | None = None) → List[str][source]

multi_cut

Multi cut – Thai word segmentation with maximum matching. Original code from Korakot Chaovavanich.

See Also:

class pythainlp.tokenize.multi_cut.LatticeString(value, multi=None, in_dict=True)[source]

String that keeps possible tokenizations

__init__(value, multi=None, in_dict=True)[source]

pythainlp.tokenize.multi_cut.mmcut(text: str) → List[str][source]

pythainlp.tokenize.multi_cut.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) → List[str][source]

Dictionary-based maximum matching word segmentation.

Parameters:

text (str) – text to be tokenized
custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

Returns:

list of segmented tokens

Return type:

List[str]

pythainlp.tokenize.multi_cut.find_all_segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) → List[str][source]

Get all possible segment variations.

Parameters:

text (str) – input string to be tokenized
custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

Returns:

list of segment variations

Return type:

List[str]

pythainlp.tokenize.multi_cut.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) → List[str][source]

Dictionary-based maximum matching word segmentation.

Parameters:

text (str) – text to be tokenized
custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

Returns:

list of segmented tokens

Return type:

List[str]

pythainlp.tokenize.multi_cut.find_all_segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) → List[str][source]

Get all possible segment variations.

Parameters:

text (str) – input string to be tokenized
custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

Returns:

list of segment variations

Return type:

List[str]

nlpo3

pythainlp.tokenize.nlpo3.load_dict(file_path: str, dict_name: str) → bool[source]

Load a dictionary file into an in-memory dictionary collection.

The loaded dictionary will be accessible throught the assigned dict_name. * This function does not override an existing dict name. *

Parameters:

file_path (str) – Path to a dictionary file
dict_name (str) – A unique dictionary name, use for reference.

:return bool

See Also:

https://github.com/PyThaiNLP/nlpo3

pythainlp.tokenize.nlpo3.segment(text: str, custom_dict: str = '_67a47bf9', safe_mode: bool = False, parallel_mode: bool = False) → List[str][source]

Break text into tokens.

Python binding for nlpO3. It is newmm engine in Rust.

Parameters:

text (str) – text to be tokenized
custom_dict (str) – dictionary name, as assigned with load_dict(), defaults to pythainlp/corpus/common/words_th.txt
safe_mode (bool) – reduce chance for long processing time in long text with many ambiguous breaking points, defaults to False
parallel_mode (bool) – Use multithread mode, defaults to False

Returns:

list of tokens

Return type:

List[str]

See Also:

https://github.com/PyThaiNLP/nlpo3

pythainlp.tokenize.nlpo3.load_dict(file_path: str, dict_name: str) → bool[source]

Load a dictionary file into an in-memory dictionary collection.

The loaded dictionary will be accessible throught the assigned dict_name. * This function does not override an existing dict name. *

Parameters:

file_path (str) – Path to a dictionary file
dict_name (str) – A unique dictionary name, use for reference.

:return bool

See Also:

https://github.com/PyThaiNLP/nlpo3

pythainlp.tokenize.nlpo3.segment(text: str, custom_dict: str = '_67a47bf9', safe_mode: bool = False, parallel_mode: bool = False) → List[str][source]

Break text into tokens.

Python binding for nlpO3. It is newmm engine in Rust.

Parameters:

text (str) – text to be tokenized
custom_dict (str) – dictionary name, as assigned with load_dict(), defaults to pythainlp/corpus/common/words_th.txt
safe_mode (bool) – reduce chance for long processing time in long text with many ambiguous breaking points, defaults to False
parallel_mode (bool) – Use multithread mode, defaults to False

Returns:

list of tokens

Return type:

List[str]

See Also:

https://github.com/PyThaiNLP/nlpo3

longest

Dictionary-based longest-matching Thai word segmentation. Implementation based on the code from Patorn Utenpattanun.

See Also:

GitHub Repository

class pythainlp.tokenize.longest.LongestMatchTokenizer(trie: Trie)[source]

__init__(trie: Trie)[source]

tokenize(text: str) → List[str][source]

pythainlp.tokenize.longest.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) → List[str][source]

Dictionary-based longest matching word segmentation.

Parameters:

text (str) – text to be tokenized to words
custom_dict (pythainlp.util.Trie) – dictionary for tokenization

Returns:

list of words, tokenized from the text

pythainlp.tokenize.longest.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) → List[str][source]

Dictionary-based longest matching word segmentation.

Parameters:

text (str) – text to be tokenized to words
custom_dict (pythainlp.util.Trie) – dictionary for tokenization

Returns:

list of words, tokenized from the text

pyicu

Wrapper for PyICU word segmentation. This wrapper module uses icu.BreakIterator with Thai as icu.Local to locate boundaries between words from the text.

See Also:

GitHub repository

pythainlp.tokenize.pyicu.segment(text: str) → List[str][source]

Parameters:: text (str) – text to be tokenized to words
Returns:: list of words, tokenized from the text

nercut

nercut 0.2

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Code by Wannaphong Phatthiyaphaibun

pythainlp.tokenize.nercut.segment(text: str, taglist: ~typing.Iterable[str] = ['ORGANIZATION', 'PERSON', 'PHONE', 'EMAIL', 'DATE', 'TIME'], tagger=<pythainlp.tag.named_entity.NER object>) → List[str][source]

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Parameters:: text (str) – text to be tokenized to words
Parm list taglist:: a list of named-entity tags to be used
Parm class tagger:: ner tagger engine
Returns:: list of words, tokenized from the text

pythainlp.tokenize.nercut.segment(text: str, taglist: ~typing.Iterable[str] = ['ORGANIZATION', 'PERSON', 'PHONE', 'EMAIL', 'DATE', 'TIME'], tagger=<pythainlp.tag.named_entity.NER object>) → List[str][source]

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Parameters:: text (str) – text to be tokenized to words
Parm list taglist:: a list of named-entity tags to be used
Parm class tagger:: ner tagger engine
Returns:: list of words, tokenized from the text

sefr_cut

Wrapper for SEFR CUT Thai word segmentation. SEFR CUT is a Thai Word Segmentation Models using Stacked Ensemble.

See Also:

GitHub repository

pythainlp.tokenize.sefr_cut.segment(text: str, engine: str = 'ws1000') → List[str][source]

oskut

Wrapper OSKut (Out-of-domain StacKed cut for Word Segmentation). Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation Stacked Ensemble Framework and DeepCut as Baseline model (ACL 2021 Findings)

See Also:

GitHub repository

pythainlp.tokenize.oskut.segment(text: str, engine: str = 'ws') → List[str][source]

newmm

The default word tokenization engine.

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries with improve the rules.

The code is based on the notebooks created by Korakot Chaovavanich, with heuristic graph size limit added to avoid exponential wait time.

See Also:

pythainlp.tokenize.newmm.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, safe_mode: bool = False) → List[str][source]

Maximal-matching word segmentation, Thai Character Cluster constrained.

A dictionary-based word segmentation using maximal matching algorithm, constrained to Thai Character Cluster boundaries.

A custom dictionary can be supplied.

Parameters:

text (str) – text to be tokenized
custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE
safe_mode (bool, optional) – reduce chance for long processing time in long text with many ambiguous breaking points, defaults to False

Returns:

list of tokens

Return type:

List[str]

pythainlp.tokenize.newmm.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, safe_mode: bool = False) → List[str][source]

Maximal-matching word segmentation, Thai Character Cluster constrained.

A dictionary-based word segmentation using maximal matching algorithm, constrained to Thai Character Cluster boundaries.

A custom dictionary can be supplied.

Parameters:

text (str) – text to be tokenized
custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE
safe_mode (bool, optional) – reduce chance for long processing time in long text with many ambiguous breaking points, defaults to False

Returns:

list of tokens

Return type:

List[str]

Subword level

tcc

The implementation of tokenizer accorinding to Thai Character Clusters (TCCs) rules purposed by Theeramunkong et al. 2000.

Credits:

TCC: Jakkrit TeCho
Grammar: Wittawat Jitkrittum (link to the source file)
Python code: Korakot Chaovavanich

pythainlp.tokenize.tcc.tcc(text: str) → str[source]

TCC generator, generates Thai Character Clusters

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: subwords (character clusters)
Return type:: Iterator[str]

pythainlp.tokenize.tcc.tcc_pos(text: str) → Set[int][source]

TCC positions

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of the end position of subwords
Return type:: set[int]

pythainlp.tokenize.tcc.segment(text: str) → List[str][source]

Subword segmentation

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of subwords (character clusters), tokenized from the text
Return type:: list[str]

pythainlp.tokenize.tcc.segment(text: str) → List[str][source]

Subword segmentation

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of subwords (character clusters), tokenized from the text
Return type:: list[str]

pythainlp.tokenize.tcc.tcc(text: str) → str[source]

TCC generator, generates Thai Character Clusters

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: subwords (character clusters)
Return type:: Iterator[str]

pythainlp.tokenize.tcc.tcc_pos(text: str) → Set[int][source]

TCC positions

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of the end position of subwords
Return type:: set[int]

tcc+ +++ .. automodule:: pythainlp.tokenize.tcc_p

pythainlp.tokenize.tcc_p.segment(text: str) → List[str][source]

Subword segmentation

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of subwords (character clusters), tokenized from the text
Return type:: list[str]

pythainlp.tokenize.tcc_p.tcc(text: str) → str[source]

TCC generator, generates Thai Character Clusters

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: subwords (character clusters)
Return type:: Iterator[str]

pythainlp.tokenize.tcc_p.tcc_pos(text: str) → Set[int][source]

TCC positions

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of the end position of subwords
Return type:: set[int]

etcc

Segmenting text to Enhanced Thai Character Cluster (ETCC) Python implementation by Wannaphong Phatthiyaphaibun

This implementation relies on a dictionary of ETCC created from etcc.txt in pythainlp/corpus.

Notebook: https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ

See Also:

Jeeragone Inrut, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

pythainlp.tokenize.etcc.segment(text: str) → List[str][source]

Segmenting text into ETCCs.

Enhanced Thai Character Cluster (ETCC) is a kind of subword unit. The concept was presented in Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of clusters, tokenized from the text
Returns:: list[str]

pythainlp.tokenize.etcc.segment(text: str) → List[str][source]

Segmenting text into ETCCs.

Enhanced Thai Character Cluster (ETCC) is a kind of subword unit. The concept was presented in Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

Parameters:: text (str) – text to be tokenized to character clusters
Returns:: list of clusters, tokenized from the text
Returns:: list[str]