nlpO3
Thai Natural Language Processing in Rust, with Python-binding.
Features
newmm dictionary-based word tokenization, at ultra fast speed
support custom dictionary
Install
[1]:
!pip install nlpo3
Collecting nlpo3
Successfully installed nlpo3-1.1.2
Using
PyThaiNLP dictionary
[2]:
from nlpo3 import segment
[3]:
segment("ทดสอบตัดคำภาษาไทย")
[3]:
['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย']
custom dictionary
We try to use a thai countries dictionary for segment text.
[4]:
!wget https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
--2021-06-22 05:14:58-- https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following]
--2021-06-22 05:14:58-- https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7622 (7.4K) [text/plain]
Saving to: ‘countries_th.txt’
countries_th.txt 100%[===================>] 7.44K --.-KB/s in 0s
2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]
[5]:
from nlpo3 import segment, load_dict
[9]:
load_dict("countries_th.txt", "countries")
Successful: dictionary name countries from file countries_th.txt has been successfully loaded
[11]:
segment("สวัสดีครับประเทศไทย เกาหลี", "countries")
[11]:
['สวัสดีครับประเทศ', 'ไทย', ' ', 'เกาหลี']
for speed of word segmentation benchmark, you can read more at https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb