Interactive online version:

nlpO3

nlpO3 is a Rust natural language processing library for Thai with Python and Node bindings. Similarly to newmm, it comes with a maximal-matching dictionary-based tokenizer, which honors Thai character cluster boundaries. However, compared to newmm, which is a pure Python implementation, nlpO3 is much faster. For a comparison, refer to Benchmark nlpo3.segment. Lern more about nlpO3 here.

In this tutorial, you will learn how to use nlpO3 to tokenize a text with a pre-prepared list of words serving as a custom dictionary.

Installation

We install the Python binding using pip.

[1]:

!pip install nlpo3

Collecting nlpo3
Successfully installed nlpo3-1.1.2

PyThaiNLP dictionary

First we try segmenting a Thai sentence into a list of words without specifying a dictionary parameter.

[2]:

from nlpo3 import segment

[3]:

segment("ทดสอบตัดคำภาษาไทย")

[3]:

['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย']

Custom dictionary

Now we enhance the tokenization with a pre-prepared list of countries in Thai, which will serve as a custom dictionary.

We use the wget command to download the list from GitHub. It’s a plain text file containing one word per line.

[4]:

!wget https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt

--2021-06-22 05:14:58--  https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following]
--2021-06-22 05:14:58--  https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7622 (7.4K) [text/plain]
Saving to: ‘countries_th.txt’

countries_th.txt    100%[===================>]   7.44K  --.-KB/s    in 0s

2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]

We use the load_dict function to load the contents of the downloaded file into the countries dictionary.

[5]:

from nlpo3 import segment, load_dict

[6]:

load_dict("countries_th.txt", "countries")

Successful: dictionary name countries from file countries_th.txt has been successfully loaded

Finally, we call the segment method on a Thai sentence specifying the countries dictionary in the parameters.

[7]:

segment("สวัสดีครับประเทศไทย เกาหลี", "countries")

[7]:

['สวัสดีครับประเทศ', 'ไทย', ' ', 'เกาหลี']