pythainlp.tokenize

The pythainlp.tokenize module contains a comprehensive set of functions and classes for tokenizing Thai text into various units, such as sentences, words, subwords, and more. This module is a fundamental component of the PyThaiNLP library, providing tools for natural language processing in the Thai language.

Modules

pythainlp.tokenize.clause_tokenize(doc: List[str]) List[List[str]][source]

Clause tokenizer. (or Clause segmentation) Tokenizes running word list into list of clauses (list of strings). Split by CRF trained on Blackboard Treebank.

Parameters:

doc (str) – word list to be clause tokenized

Returns:

list of clauses

Return type:

list[list[str]]

Example:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน'],
# ['และ', 'คุณ', 'เล่น', 'มือถือ'],
# ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Tokenizes text into clauses. This function allows you to split text into meaningful sections, making it useful for more advanced text processing tasks.

pythainlp.tokenize.sent_tokenize(text: str, engine: str = 'crfcut', keep_whitespace: bool = True) List[str][source]

Sentence tokenizer.

Tokenizes running text into “sentences”

Parameters:
  • text (str) – the text to be tokenized

  • engine (str) – choose among ‘crfcut’, ‘whitespace’, ‘whitespace+newline’

Returns:

list of split sentences

Return type:

list[str]

Options for engine
  • crfcut - (default) split by CRF trained on TED dataset

  • thaisum - The implementation of sentence segmenter from Nakhun Chumpolsathien, 2020

  • tltk - split by TLTK.,

  • wtp - split by wtpsplitaxe., It supports many sizes of models. You can use wtp to use mini model, wtp-tiny to use wtp-bert-tiny model (default), wtp-mini to use wtp-bert-mini model, wtp-base to use wtp-canine-s-1l model, and wtp-large to use wtp-canine-s-12l model.

  • whitespace+newline - split by whitespace and newline.

  • whitespace - split by whitespace, specifically with regex pattern r" +"

Example:

Split the text based on whitespace:

from pythainlp.tokenize import sent_tokenize

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="whitespace")
# output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม']

sent_tokenize(sentence_2, engine="whitespace")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ',
#   '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Split the text based on whitespace and newline:

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="whitespace+newline")
# output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม']
sent_tokenize(sentence_2, engine="whitespace+newline")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ',
'\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Split the text using CRF trained on TED dataset:

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="crfcut")
# output: ['ฉันไปประชุมเมื่อวันที่ 11 มีนาคม']

sent_tokenize(sentence_2, engine="crfcut")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ ',
'และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Splits Thai text into sentences. This function identifies sentence boundaries, which is essential for text segmentation and analysis.

pythainlp.tokenize.paragraph_tokenize(text: str, engine: str = 'wtp-mini', paragraph_threshold: float = 0.5, style: str = 'newline') List[List[str]][source]

Paragraph tokenizer.

Tokenizes text into paragraphs.

Parameters:
  • text (str) – text to be tokenized

  • engine (str) – the name of paragraph tokenizer

Returns:

list of paragraphs

Return type:

List[List[str]]

Options for engine
  • wtp - split by wtpsplitaxe., It supports many sizes of models. You can use wtp to use mini model, wtp-tiny to use wtp-bert-tiny model (default), wtp-mini to use wtp-bert-mini model, wtp-base to use wtp-canine-s-1l model, and wtp-large to use wtp-canine-s-12l model.

Example:

Split the text based on wtp:

from pythainlp.tokenize import paragraph_tokenize

sent = (
    "(1) บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต"
    +"  มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด"
    +" จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้"
)

paragraph_tokenize(sent)
# output: [
# ['(1) '], 
# [
#   'บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
#   'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ',
#   'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ',
#   'ณ ที่นี้'
# ]]

Segments text into paragraphs, which can be valuable for document-level analysis or summarization.

pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc', keep_whitespace: bool = True) List[str][source]

Subword tokenizer for tokenizing text into units smaller than syllables.

Tokenizes text into inseparable units of Thai contiguous characters, namely Thai Character Clusters (TCCs) TCCs are units based on Thai spelling features that could not be separated any character further such as ‘ก็’, ‘จะ’, ‘ไม่’, and ‘ฝา’. If the following units are separated, they could not be spelled out. This function applies TCC rules to tokenize the text into the smallest units.

For example, the word ‘ขนมชั้น’ would be tokenized into ‘ข’, ‘น’, ‘ม’, and ‘ชั้น’.

Parameters:
  • text (str) – text to be tokenized

  • engine (str) – the name of subword tokenizer

  • keep_whitespace (bool) – keep whitespace

Returns:

list of subwords

Return type:

List[str]

Options for engine
  • dict - newmm word tokenizer with a syllable dictionary

  • etcc - Enhanced Thai Character Cluster (Inrut et al. 2001)

  • han_solo - CRF syllable segmenter for Thai that can work in the Thai social media domain. See PyThaiNLP/Han-solo.

  • ssg - CRF syllable segmenter for Thai. See ponrawee/ssg.

  • tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)

  • tcc_p - Thai Character Cluster + improved rules that are used in newmm

  • tltk - syllable tokenizer from tltk. See tltk.

  • wangchanberta - SentencePiece from wangchanberta model

Example:

Tokenize text into subwords based on tcc:

from pythainlp.tokenize import subword_tokenize

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='tcc')
# output: ['ยุ', 'ค', 'เริ่ม', 'แร', 'ก',
#   'ข', 'อ', 'ง', ' ', 'รา', 'ช', 'ว', 'ง',
#   'ศ', '์', 'ห', 'มิ', 'ง']

subword_tokenize(text_2, engine='tcc')
# output: ['ค', 'วา', 'ม', 'แป', 'ล', 'ก', 'แย', 'ก',
'และ', 'พัฒ','นา', 'กา', 'ร']

Tokenize text into subwords based on etcc:

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='etcc')
# output: ['ยุคเริ่มแรกของ ราชวงศ์หมิง']

subword_tokenize(text_2, engine='etcc')
# output: ['ความแปลกแยกและ', 'พัฒ', 'นาการ']

Tokenize text into subwords based on wangchanberta:

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='wangchanberta')
# output: ['▁', 'ยุค', 'เริ่มแรก', 'ของ', '▁', 'ราชวงศ์', 'หมิง']

subword_tokenize(text_2, engine='wangchanberta')
# output: ['▁ความ', 'แปลก', 'แยก', 'และ', 'พัฒนาการ']

Tokenizes text into subwords, which can be helpful for various NLP tasks, including subword embeddings.

pythainlp.tokenize.syllable_tokenize(text: str, engine: str = 'han_solo', keep_whitespace: bool = True) List[str][source]

Syllable tokenizer

Tokenizes text into inseparable units of Thai syllables.

Parameters:
  • text (str) – text to be tokenized

  • engine (str) – the name of syllable tokenizer

  • keep_whitespace (bool) – keep whitespace

Returns:

list of subwords

Return type:

List[str]

Options for engine
  • dict - newmm word tokenizer with a syllable dictionary

  • han_solo - CRF syllable segmenter for Thai that can work in the Thai social media domain. See PyThaiNLP/Han-solo.

  • ssg - CRF syllable segmenter for Thai. See ponrawee/ssg.

  • tltk - syllable tokenizer from tltk. See tltk.

Divides text into syllables, allowing you to work with individual Thai language phonetic units.

pythainlp.tokenize.word_tokenize(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True) List[str][source]

Word tokenizer.

Tokenizes running text into words (list of strings).

Parameters:
  • text (str) – text to be tokenized

  • engine (str) – name of the tokenizer to be used

  • custom_dict (pythainlp.util.Trie) – dictionary trie (some engine may not support)

  • keep_whitespace (bool) – True to keep whitespace, a common mark for end of phrase in Thai. Otherwise, whitespace is omitted.

  • join_broken_num (bool) – True to rejoin formatted numeric that could be wrongly separated. Otherwise, formatted numeric could be wrongly separated.

Returns:

list of words

Return type:

List[str]

Options for engine
  • attacut - wrapper for AttaCut., learning-based approach

  • deepcut - wrapper for DeepCut, learning-based approach

  • icu - wrapper for a word tokenizer in PyICU., from ICU (International Components for Unicode), dictionary-based

  • longest - dictionary-based, longest matching

  • mm - “multi-cut”, dictionary-based, maximum matching

  • nercut - dictionary-based, maximal matching, constrained by Thai Character Cluster (TCC) boundaries, combining tokens that are parts of the same named-entity

  • newmm (default) - “new multi-cut”, dictionary-based, maximum matching, constrained by Thai Character Cluster (TCC) boundaries with improved TCC rules that are used in newmm.

  • newmm-safe - newmm, with a mechanism to avoid long processing time for text with continuously ambiguous breaking points

  • nlpo3 - wrapper for a word tokenizer in nlpO3., adaptation of newmm in Rust (2.5x faster)

  • oskut - wrapper for OSKut., Out-of-domain StacKed cut for Word Segmentation

  • sefr_cut - wrapper for SEFR CUT., Stacked Ensemble Filter and Refine for Word Segmentation

  • tltk - wrapper for TLTK.,

    maximum collocation approach

Note:
  • The custom_dict parameter only works for deepcut, longest, newmm, and newmm-safe engines.

Example:

Tokenize text with different tokenizers:

from pythainlp.tokenize import word_tokenize

text = "โอเคบ่พวกเรารักภาษาบ้านเกิด"

word_tokenize(text, engine="newmm")
# output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']

word_tokenize(text, engine='attacut')
# output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']

Tokenize text with whitespace omitted:

text = "วรรณกรรม ภาพวาด และการแสดงงิ้ว "

word_tokenize(text, engine="newmm")
# output:
# ['วรรณกรรม', ' ', 'ภาพวาด', ' ', 'และ', 'การแสดง', 'งิ้ว', ' ']

word_tokenize(text, engine="newmm", keep_whitespace=False)
# output: ['วรรณกรรม', 'ภาพวาด', 'และ', 'การแสดง', 'งิ้ว']

Join broken formatted numeric (e.g. time, decimals, IP addresses):

text = "เงิน1,234บาท19:32น 127.0.0.1"

word_tokenize(text, engine="attacut", join_broken_num=False)
# output:
# ['เงิน', '1', ',', '234', 'บาท', '19', ':', '32น', ' ',
#  '127', '.', '0', '.', '0', '.', '1']

word_tokenize(text, engine="attacut", join_broken_num=True)
# output:
# ['เงิน', '1,234', 'บาท', '19:32น', ' ', '127.0.0.1']

Tokenize with default and custom dictionaries:

from pythainlp.corpus.common import thai_words
from pythainlp.tokenize import dict_trie

text = 'ชินโซ อาเบะ เกิด 21 กันยายน'

word_tokenize(text, engine="newmm")
# output:
# ['ชิน', 'โซ', ' ', 'อา', 'เบะ', ' ',
#  'เกิด', ' ', '21', ' ', 'กันยายน']

custom_dict_japanese_name = set(thai_words()
custom_dict_japanese_name.add('ชินโซ')
custom_dict_japanese_name.add('อาเบะ')

trie = dict_trie(dict_source=custom_dict_japanese_name)

word_tokenize(text, engine="newmm", custom_dict=trie))
# output:
# ['ชินโซ', ' ', 'อาเบะ', ' ',
#  'เกิด', ' ', '21', ' ', 'กันยายน']

Splits text into words. This function is a fundamental tool for Thai language text analysis.

pythainlp.tokenize.word_detokenize(segments: List[List[str]] | List[str], output: str = 'str') List[str] | str[source]

Word detokenizer.

This function will detokenize the list of words in each sentence into text.

Parameters:
  • segments (str) – List of sentences, each with a list of words.

  • output (str) – the output type (str or list)

Returns:

the Thai text

Return type:

Union[str,List[str]]

Example:

from pythainlp.tokenize import word_detokenize
print(word_detokenize(["เรา", "เล่น"]))
# output: เราเล่น

Reverses the tokenization process, reconstructing text from tokenized units. Useful for text generation tasks.

class pythainlp.tokenize.Tokenizer(custom_dict: Trie | Iterable[str] | str = [], engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True)[source]

Tokenizer class for a custom tokenizer.

This class allows users to pre-define custom dictionary along with tokenizer and encapsulate them into one single object. It is an wrapper for both functions, that are pythainlp.tokenize.word_tokenize(), and pythainlp.util.dict_trie()

Example:

Tokenizer object instantiated with pythainlp.util.Trie:

from pythainlp.tokenize import Tokenizer
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie

custom_words_list = set(thai_words())
custom_words_list.add('อะเฟเซีย')
custom_words_list.add('Aphasia')
trie = dict_trie(dict_source=custom_words_list)

text = "อะเฟเซีย (Aphasia*) เป็นอาการผิดปกติของการพูด"
_tokenizer = Tokenizer(custom_dict=trie, engine='newmm')
_tokenizer.word_tokenize(text)
# output: ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ',
'ผิดปกติ', 'ของ', 'การ', 'พูด']

Tokenizer object instantiated with a list of words:

text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด"
_tokenizer = Tokenizer(custom_dict=list(thai_words()), engine='newmm')
_tokenizer.word_tokenize(text)
# output:
# ['อะ', 'เฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ',
#   'ผิดปกติ', 'ของ', 'การ', 'พูด']

Tokenizer object instantiated with a file path containing a list of words separated with newline and explicitly setting a new tokenizer after initiation:

PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt'

# write a file
with open(PATH_TO_CUSTOM_DICTIONARY, 'w', encoding='utf-8') as f:
    f.write('อะเฟเซีย\nAphasia\nผิด\nปกติ')

text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด"

# initiate an object from file with `attacut` as tokenizer
_tokenizer = Tokenizer(custom_dict=PATH_TO_CUSTOM_DICTIONARY, \
    engine='attacut')

_tokenizer.word_tokenize(text)
# output:
# ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิด',
#   'ปกติ', 'ของ', 'การ', 'พูด']

# change tokenizer to `newmm`
_tokenizer.set_tokenizer_engine(engine='newmm')
_tokenizer.word_tokenize(text)
# output:
# ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็นอาการ', 'ผิด',
#   'ปกติ', 'ของการพูด']

The Tokenizer class is a versatile tool for customizing tokenization processes and managing tokenization models. It provides various methods and attributes to fine-tune tokenization according to your specific needs.

__init__(custom_dict: Trie | Iterable[str] | str = [], engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True)[source]

Initialize tokenizer object.

Parameters:
  • custom_dict (str) – a file path, a list of vocaburaies* to be used to create a trie, or an instantiated pythainlp.util.Trie object.

  • engine (str) – choose between different options of tokenizer engines (i.e. newmm, mm, longest, deepcut)

  • keep_whitespace (bool) – True to keep whitespace, a common mark for end of phrase in Thai

word_tokenize(text: str) List[str][source]

Main tokenization function.

Parameters:

text (str) – text to be tokenized

Returns:

list of words, tokenized from the text

Return type:

list[str]

set_tokenize_engine(engine: str) None[source]

Set the tokenizer’s engine.

Parameters:

engine (str) – choose between different options of tokenizer engines (i.e. newmm, mm, longest, deepcut)

Tokenization Engines

This module offers multiple tokenization engines designed for different levels of text analysis.

Sentence level

crfcut

CRFCut - Thai sentence segmenter.

Thai sentence segmentation using conditional random field, with default model trained on TED dataset

Performance: - ORCHID - space-correct accuracy 87% vs 95% state-of-the-art

  • TED dataset - space-correct accuracy 82%

See development notebooks at https://github.com/vistec-AI/ted_crawler; POS features are not used due to unreliable POS tagging available

A tokenizer that operates at the sentence level using Conditional Random Fields (CRF). It is suitable for segmenting text into sentences accurately.

pythainlp.tokenize.crfcut.extract_features(doc: List[str], window: int = 2, max_n_gram: int = 3) List[List[str]][source]

Extract features for CRF by sliding max_n_gram of tokens for +/- window from the current token

Parameters:
  • doc (List[str]) – tokens from which features are to be extracted

  • window (int) – size of window before and after the current token

  • max_n_gram (int) – create n_grams from 1-gram to max_n_gram-gram within the window

Returns:

list of lists of features to be fed to CRF

pythainlp.tokenize.crfcut.segment(text: str) List[str][source]

CRF-based sentence segmentation.

Parameters:

text (str) – text to be tokenized into sentences

Returns:

list of words, tokenized from the text

thaisumcut

The implementation of sentence segmentator from Nakhun Chumpolsathien, 2020 original codes are from: https://github.com/nakhunchumpolsathien/ThaiSum

Cite:

@mastersthesis{chumpolsathien_2020,

title={Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization}, author={Chumpolsathien, Nakhun}, year={2020}, school={Beijing Institute of Technology}

A sentence tokenizer based on a maximum entropy model. It’s a great choice for sentence boundary detection in Thai text.

pythainlp.tokenize.thaisumcut.list_to_string(list: List[str]) str[source]
pythainlp.tokenize.thaisumcut.middle_cut(sentences: List[str]) List[str][source]
class pythainlp.tokenize.thaisumcut.ThaiSentenceSegmentor[source]
split_into_sentences(text: str, isMiddleCut: bool = False) List[str][source]

Word level

attacut

Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai

See Also:

A tokenizer designed for word-level segmentation. It provides accurate word boundary detection in Thai text.

class pythainlp.tokenize.attacut.AttacutTokenizer(model='attacut-sc')[source]
__init__(model='attacut-sc')[source]
tokenize(text: str) List[str][source]
pythainlp.tokenize.attacut.segment(text: str, model: str = 'attacut-sc') List[str][source]

Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai :param str text: text to be tokenized to words :param str model: model of word tokenizer model :return: list of words, tokenized from the text :rtype: list[str] Options for model

  • attacut-sc (default) using both syllable and character features

  • attacut-c using only character feature

deepcut

Wrapper for deepcut Thai word segmentation. deepcut is a Thai word segmentation library using 1D Convolution Neural Network.

User need to install deepcut (and its dependency: tensorflow) by themselves.

See Also:

Utilizes deep learning techniques for word segmentation, achieving high accuracy and performance.

pythainlp.tokenize.deepcut.segment(text: str, custom_dict: Trie | List[str] | str = []) List[str][source]

multi_cut

Multi cut – Thai word segmentation with maximum matching. Original codes from Korakot Chaovavanich.

See Also:

An ensemble tokenizer that combines multiple tokenization strategies for improved word segmentation.

class pythainlp.tokenize.multi_cut.LatticeString(value, multi=None, in_dict=True)[source]

String that keeps possible tokenizations

__init__(value, multi=None, in_dict=True)[source]
pythainlp.tokenize.multi_cut.mmcut(text: str) List[str][source]
pythainlp.tokenize.multi_cut.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str][source]

Dictionary-based maximum matching word segmentation.

Parameters:
  • text (str) – text to be tokenized

  • custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

Returns:

list of segmented tokens

Return type:

List[str]

pythainlp.tokenize.multi_cut.find_all_segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str][source]

Get all possible segment variations.

Parameters:
  • text (str) – input string to be tokenized

  • custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

Returns:

list of segment variations

Return type:

List[str]

nlpo3

A word tokenizer based on the NLPO3 model. It offers advanced word boundary detection and is suitable for various NLP tasks.

pythainlp.tokenize.nlpo3.load_dict(file_path: str, dict_name: str) bool[source]

Load a dictionary file into an in-memory dictionary collection.

The loaded dictionary will be accessible through the assigned dict_name. * This function does not override an existing dict name. *

Parameters:
  • file_path (str) – Path to a dictionary file

  • dict_name (str) – A unique dictionary name, used for reference.

:return bool

See Also:
pythainlp.tokenize.nlpo3.segment(text: str, custom_dict: str = '_67a47bf9', safe_mode: bool = False, parallel_mode: bool = False) List[str][source]

Break text into tokens.

Python binding for nlpO3. It is newmm engine in Rust.

Parameters:
  • text (str) – text to be tokenized

  • custom_dict (str) – dictionary name, as assigned with load_dict(), defaults to pythainlp/corpus/common/words_th.txt

  • safe_mode (bool) – reduce chance for long processing time for long text with many ambiguous breaking points, defaults to False

  • parallel_mode (bool) – Use multithread mode, defaults to False

Returns:

list of tokens

Return type:

List[str]

See Also:

longest

Dictionary-based longest-matching Thai word segmentation. Implementation is based on the codes from Patorn Utenpattanun.

See Also:

A tokenizer that identifies word boundaries by selecting the longest possible words in a text.

class pythainlp.tokenize.longest.LongestMatchTokenizer(trie: Trie)[source]
__init__(trie: Trie)[source]
tokenize(text: str) List[str][source]
pythainlp.tokenize.longest.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str][source]

Dictionary-based longest matching word segmentation.

Parameters:
  • text (str) – text to be tokenized into words

  • custom_dict (pythainlp.util.Trie) – dictionary for tokenization

Returns:

list of words, tokenized from the text

pyicu

Wrapper for PyICU word segmentation. This wrapper module uses icu.BreakIterator with Thai as icu.Local to locate boundaries between words in the text.

See Also:

An ICU-based word tokenizer offering robust support for Thai text segmentation.

pythainlp.tokenize.pyicu.segment(text: str) List[str][source]
Parameters:

text (str) – text to be tokenized into words

Returns:

list of words, tokenized from the text

nercut

nercut 0.2

Dictionary-based maximal matching word segmentation, constrained by Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named entity.

Code by Wannaphong Phatthiyaphaibun

A tokenizer optimized for Named Entity Recognition (NER) tasks, ensuring accurate tokenization for entity recognition.

pythainlp.tokenize.nercut.segment(text: str, taglist: ~typing.Iterable[str] = ['ORGANIZATION', 'PERSON', 'PHONE', 'EMAIL', 'DATE', 'TIME'], tagger=<pythainlp.tag.named_entity.NER object>) List[str][source]

Dictionary-based maximal matching word segmentation, constrained by Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Parameters:
  • text (str) – text to be tokenized into words

  • taglist (list) – a list of named entity tags to be used

  • tagger (class) – NER tagger engine

Returns:

list of words, tokenized from the text

sefr_cut

Wrapper for SEFR CUT Thai word segmentation. SEFR CUT is a Thai Word Segmentation Models using Stacked Ensemble.

See Also:

An advanced word tokenizer for segmenting Thai text, with a focus on precision.

pythainlp.tokenize.sefr_cut.segment(text: str, engine: str = 'ws1000') List[str][source]

oskut

Wrapper OSKut (Out-of-domain StacKed cut for Word Segmentation). Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation Stacked Ensemble Framework and DeepCut as Baseline model (ACL 2021 Findings)

See Also:

A tokenizer that uses a pre-trained model for word segmentation. It’s a reliable choice for general-purpose text analysis.

pythainlp.tokenize.oskut.segment(text: str, engine: str = 'ws') List[str][source]

newmm (Default)

Dictionary-based maximal matching word segmentation, constrained by Thai Character Cluster (TCC) boundaries with improved rules.

The codes are based on the notebooks created by Korakot Chaovavanich, with heuristic graph size limit added to avoid exponential waiting time.

See Also:

The default word tokenization engine that provides a balance between accuracy and efficiency for most use cases.

pythainlp.tokenize.newmm.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, safe_mode: bool = False) List[str][source]

Maximal-matching word segmentation constrained by Thai Character Cluster.

A dictionary-based word segmentation using maximal matching algorithm, constrained by Thai Character Cluster boundaries.

A custom dictionary can be supplied.

Parameters:
  • text (str) – text to be tokenized

  • custom_dict (Trie, optional) – tokenization dictionary, defaults to DEFAULT_WORD_DICT_TRIE

  • safe_mode (bool, optional) – reduce chance for long processing time for long text with many ambiguous breaking points, defaults to False

Returns:

list of tokens

Return type:

List[str]

Subword level

tcc

The implementation of tokenizer according to Thai Character Clusters (TCCs) rules proposed by Theeramunkong et al. 2000.

Credits:

Tokenizes text into Thai Character Clusters (TCCs), a subword level representation.

pythainlp.tokenize.tcc.tcc(text: str) str[source]

TCC generator which generates Thai Character Clusters

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

subwords (character clusters)

Return type:

Iterator[str]

pythainlp.tokenize.tcc.tcc_pos(text: str) Set[int][source]

TCC positions

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

list of the ending position of subwords

Return type:

set[int]

pythainlp.tokenize.tcc.segment(text: str) List[str][source]

Subword segmentation

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

list of subwords (character clusters), tokenized from the text

Return type:

list[str]

tcc+

The implementation of tokenizer according to Thai Character Clusters (TCCs) rules proposed by Theeramunkong et al. 2000. and improved rules that are used in newmm

Credits:

A subword tokenizer that includes additional rules for more precise subword segmentation.

pythainlp.tokenize.tcc_p.tcc(text: str) str[source]

TCC generator which generates Thai Character Clusters

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

subwords (character clusters)

Return type:

Iterator[str]

pythainlp.tokenize.tcc_p.tcc_pos(text: str) Set[int][source]

TCC positions

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

list of the ending position of subwords

Return type:

set[int]

pythainlp.tokenize.tcc_p.segment(text: str) List[str][source]

Subword segmentation

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

list of subwords (character clusters), tokenized from the text

Return type:

list[str]

etcc

Segmenting text into Enhanced Thai Character Clusters (ETCCs) Python implementation by Wannaphong Phatthiyaphaibun

This implementation relies on a dictionary of ETCC created from etcc.txt in pythainlp/corpus.

Notebook: https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ

See Also:

Jeeragone Inrut, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

Enhanced Thai Character Clusters (eTCC) tokenizer for subword-level analysis.

pythainlp.tokenize.etcc.segment(text: str) List[str][source]

Segmenting text into ETCCs.

Enhanced Thai Character Cluster (ETCC) is a kind of subword unit. The concept was presented in Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

Parameters:

text (str) – text to be tokenized into character clusters

Returns:

list of clusters, tokenized from the text

Returns:

List[str]

han_solo

🪿 Han-solo: Thai syllable segmenter

GitHub: https://github.com/PyThaiNLP/Han-solo

A subword tokenizer specialized for Han characters and mixed scripts, suitable for various text processing scenarios.

class pythainlp.tokenize.han_solo.Featurizer(N=2, sequence_size=1, delimiter=None)[source]
__init__(N=2, sequence_size=1, delimiter=None)[source]
pad(sentence, padder='#')[source]
featurize(sentence, padding=True, indiv_char=True, return_type='list')[source]
pythainlp.tokenize.han_solo.segment(text: str) List[str][source]