Interactive online version: Binder badge Google Colab badge


nlpO3 is a Rust natural language processing library for Thai with Python and Node bindings. Similarly to newmm, it comes with a maximal-matching dictionary-based tokenizer, which honors Thai character cluster boundaries. However, compared to newmm, which is a pure Python implementation, nlpO3 is much faster. For a comparison, refer to Benchmark nlpo3.segment. Lern more about nlpO3 here.

In this tutorial, you will learn how to use nlpO3 to tokenize a text with a pre-prepared list of words serving as a custom dictionary.


We install the Python binding using pip.

!pip install nlpo3
Collecting nlpo3
Successfully installed nlpo3-1.1.2

PyThaiNLP dictionary

First we try segmenting a Thai sentence into a list of words without specifying a dictionary parameter.

from nlpo3 import segment
['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย']

Custom dictionary

Now we enhance the tokenization with a pre-prepared list of countries in Thai, which will serve as a custom dictionary.

We use the wget command to download the list from GitHub. It’s a plain text file containing one word per line.

--2021-06-22 05:14:58--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: [following]
--2021-06-22 05:14:58--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7622 (7.4K) [text/plain]
Saving to: ‘countries_th.txt’

countries_th.txt    100%[===================>]   7.44K  --.-KB/s    in 0s

2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]

We use the load_dict function to load the contents of the downloaded file into the countries dictionary.

from nlpo3 import segment, load_dict
load_dict("countries_th.txt", "countries")
Successful: dictionary name countries from file countries_th.txt has been successfully loaded

Finally, we call the segment method on a Thai sentence specifying the countries dictionary in the parameters.

segment("สวัสดีครับประเทศไทย เกาหลี", "countries")
['สวัสดีครับประเทศ', 'ไทย', ' ', 'เกาหลี']