Interactive online version: Binder badge Google Colab badge

nlpO3

nlpO3 is a Rust natural language processing library for Thai with Python and Node bindings. Similarly to newmm, it comes with a maximal-matching dictionary-based tokenizer, which honors Thai character cluster boundaries. However, compared to newmm, which is a pure Python implementation, nlpO3 is much faster. For a comparison, refer to Benchmark nlpo3.segment. Lern more about nlpO3 here.

In this tutorial, you will learn how to use nlpO3 to tokenize a text with a pre-prepared list of words serving as a custom dictionary.

Installation

We install the Python binding using pip.

[1]:
!pip install nlpo3
Collecting nlpo3
Successfully installed nlpo3-1.1.2

PyThaiNLP dictionary

First we try segmenting a Thai sentence into a list of words without specifying a dictionary parameter.

[2]:
from nlpo3 import segment
[3]:
segment("āļ—āļ”āļŠāļ­āļšāļ•āļąāļ”āļ„āļģāļ āļēāļĐāļēāđ„āļ—āļĒ")
[3]:
['āļ—āļ”āļŠāļ­āļš', 'āļ•āļąāļ”', 'āļ„āļģ', 'āļ āļēāļĐāļēāđ„āļ—āļĒ']

Custom dictionary

Now we enhance the tokenization with a pre-prepared list of countries in Thai, which will serve as a custom dictionary.

We use the wget command to download the list from GitHub. It’s a plain text file containing one word per line.

[4]:
!wget https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
--2021-06-22 05:14:58--  https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following]
--2021-06-22 05:14:58--  https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7622 (7.4K) [text/plain]
Saving to: ‘countries_th.txt’

countries_th.txt    100%[===================>]   7.44K  --.-KB/s    in 0s

2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]

We use the load_dict function to load the contents of the downloaded file into the countries dictionary.

[5]:
from nlpo3 import segment, load_dict
[6]:
load_dict("countries_th.txt", "countries")
Successful: dictionary name countries from file countries_th.txt has been successfully loaded

Finally, we call the segment method on a Thai sentence specifying the countries dictionary in the parameters.

[7]:
segment("āļŠāļ§āļąāļŠāļ”āļĩāļ„āļĢāļąāļšāļ›āļĢāļ°āđ€āļ—āļĻāđ„āļ—āļĒ āđ€āļāļēāļŦāļĨāļĩ", "countries")
[7]:
['āļŠāļ§āļąāļŠāļ”āļĩāļ„āļĢāļąāļšāļ›āļĢāļ°āđ€āļ—āļĻ', 'āđ„āļ—āļĒ', ' ', 'āđ€āļāļēāļŦāļĨāļĩ']