nlpO3 is a Rust natural language processing library for Thai with Python and Node bindings. Similarly to newmm, it comes with a maximal-matching dictionary-based tokenizer, which honors Thai character cluster boundaries. However, compared to newmm, which is a pure Python implementation, nlpO3 is much faster. For a comparison, refer to Benchmark nlpo3.segment. Lern more about nlpO3 here.
In this tutorial, you will learn how to use nlpO3 to tokenize a text with a pre-prepared list of words serving as a custom dictionary.
We install the Python binding using pip.
!pip install nlpo3
Collecting nlpo3 Successfully installed nlpo3-1.1.2
First we try segmenting a Thai sentence into a list of words without specifying a dictionary parameter.
from nlpo3 import segment
['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย']
Now we enhance the tokenization with a pre-prepared list of countries in Thai, which will serve as a custom dictionary.
We use the
wget command to download the list from GitHub. It’s a plain text file containing one word per line.
--2021-06-22 05:14:58-- https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt Resolving github.com (github.com)... 184.108.40.206 Connecting to github.com (github.com)|220.127.116.11|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following] --2021-06-22 05:14:58-- https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 18.104.22.168, 22.214.171.124, 126.96.36.199, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|188.8.131.52|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7622 (7.4K) [text/plain] Saving to: ‘countries_th.txt’ countries_th.txt 100%[===================>] 7.44K --.-KB/s in 0s 2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]
We use the
load_dict function to load the contents of the downloaded file into the
from nlpo3 import segment, load_dict
Successful: dictionary name countries from file countries_th.txt has been successfully loaded
Finally, we call the
segment method on a Thai sentence specifying the
countries dictionary in the parameters.
segment("สวัสดีครับประเทศไทย เกาหลี", "countries")
['สวัสดีครับประเทศ', 'ไทย', ' ', 'เกาหลี']