.. currentmodule:: pythainlp.tokenize
.. _tokenize-doc:

pythainlp.tokenize
==================
The :mod:`pythainlp.tokenize` module contains a comprehensive set of functions and classes for tokenizing Thai text into various units, such as sentences, words, subwords, and more. This module is a fundamental component of the PyThaiNLP library, providing tools for natural language processing in the Thai language.

Modules
-------

.. autofunction:: clause_tokenize
    :noindex:
    
    Tokenizes text into clauses. This function allows you to split text into meaningful sections, making it useful for more advanced text processing tasks.

.. autofunction:: sent_tokenize
    :noindex:
    
    Splits Thai text into sentences. This function identifies sentence boundaries, which is essential for text segmentation and analysis.

.. autofunction:: paragraph_tokenize
    :noindex:
    
    Segments text into paragraphs, which can be valuable for document-level analysis or summarization.

.. autofunction:: subword_tokenize
    :noindex:
    
    Tokenizes text into subwords, which can be helpful for various NLP tasks, including subword embeddings.

.. autofunction:: syllable_tokenize
    :noindex:
    
    Divides text into syllables, allowing you to work with individual Thai language phonetic units.

.. autofunction:: word_tokenize
    :noindex:
    
    Splits text into words. This function is a fundamental tool for Thai language text analysis.

.. autofunction:: word_detokenize
    :noindex:
    
    Reverses the tokenization process, reconstructing text from tokenized units. Useful for text generation tasks.

.. autoclass:: Tokenizer
    :members:
    
    The `Tokenizer` class is a versatile tool for customizing tokenization processes and managing tokenization models. It provides various methods and attributes to fine-tune tokenization according to your specific needs.

Tokenization Engines
--------------------

This module offers multiple tokenization engines designed for different levels of text analysis.

Sentence level
--------------

**crfcut**
  
.. automodule:: pythainlp.tokenize.crfcut
    :members:
    
    A tokenizer that operates at the sentence level using Conditional Random Fields (CRF). It is suitable for segmenting text into sentences accurately.

**thaisumcut**
  
.. automodule:: pythainlp.tokenize.thaisumcut
    :members:
    
    A sentence tokenizer based on a maximum entropy model. It's a great choice for sentence boundary detection in Thai text.

Word level
----------

**attacut**
  
.. automodule:: pythainlp.tokenize.attacut
    :members:
    
    A tokenizer designed for word-level segmentation. It provides accurate word boundary detection in Thai text.

**deepcut**
  
.. automodule:: pythainlp.tokenize.deepcut
    :members:
    
    Utilizes deep learning techniques for word segmentation, achieving high accuracy and performance.

**multi_cut**
  
.. automodule:: pythainlp.tokenize.multi_cut
    :members:
    
    An ensemble tokenizer that combines multiple tokenization strategies for improved word segmentation.

**nlpo3**
  
.. automodule:: pythainlp.tokenize.nlpo3
    :members:
    
    A word tokenizer based on the NLPO3 model. It offers advanced word boundary detection and is suitable for various NLP tasks.

**longest**
  
.. automodule:: pythainlp.tokenize.longest
    :members:
    
    A tokenizer that identifies word boundaries by selecting the longest possible words in a text.

**pyicu**
  
.. automodule:: pythainlp.tokenize.pyicu
    :members:
    
    An ICU-based word tokenizer offering robust support for Thai text segmentation.

**nercut**
  
.. automodule:: pythainlp.tokenize.nercut
    :members:
    
    A tokenizer optimized for Named Entity Recognition (NER) tasks, ensuring accurate tokenization for entity recognition.

**sefr_cut**
  
.. automodule:: pythainlp.tokenize.sefr_cut
    :members:
    
    An advanced word tokenizer for segmenting Thai text, with a focus on precision.

**oskut**
  
.. automodule:: pythainlp.tokenize.oskut
    :members:
    
    A tokenizer that uses a pre-trained model for word segmentation. It's a reliable choice for general-purpose text analysis.

**newmm (Default)**
  
.. automodule:: pythainlp.tokenize.newmm
    :members:
    
    The default word tokenization engine that provides a balance between accuracy and efficiency for most use cases.

Subword level
-------------

**tcc**
  
.. automodule:: pythainlp.tokenize.tcc
    :members:
    
    Tokenizes text into Thai Character Clusters (TCCs), a subword level representation.

**tcc+**
  
.. automodule:: pythainlp.tokenize.tcc_p
    :members:
    
    A subword tokenizer that includes additional rules for more precise subword segmentation.

**etcc**
  
.. automodule:: pythainlp.tokenize.etcc
    :members:
    
    Enhanced Thai Character Clusters (eTCC) tokenizer for subword-level analysis.

**han_solo**
  
.. automodule:: pythainlp.tokenize.han_solo
    :members:
    
    A subword tokenizer specialized for Han characters and mixed scripts, suitable for various text processing scenarios.