Interactive online version:

PyThaiNLP Get Started

Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp

[1]:

# # pip install required modules
# # uncomment if running from colab
# # see list of modules in `requirements` and `extras`
# # in https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py

#!pip install pythainlp
#!pip install epitran

Import PyThaiNLP

[2]:

import pythainlp

pythainlp.__version__

[2]:

'2.2.1'

Thai Characters

PyThaiNLP provides some ready-to-use Thai character set (e.g. Thai consonants, vowels, tonemarks, symbols) as a string for convenience. There are also few utility functions to test if a string is in Thai or not.

[3]:

pythainlp.thai_characters

[3]:

'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮฤฦะัาำิีึืุูเแโใไๅํ็่้๊๋ฯฺๆ์ํ๎๏๚๛๐๑๒๓๔๕๖๗๘๙฿'

[4]:

len(pythainlp.thai_characters)

[4]:

[5]:

pythainlp.thai_consonants

[5]:

'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ'

[6]:

len(pythainlp.thai_consonants)

[6]:

[7]:

"๔" in pythainlp.thai_digits  # check if Thai digit "4" is in the character set

[7]:

True

Checking if a string contains Thai character or not, or how many

[8]:

import pythainlp.util

pythainlp.util.isthai("ก")

[8]:

True

[9]:

pythainlp.util.isthai("(ก.พ.)")

[9]:

False

[10]:

pythainlp.util.isthai("(ก.พ.)", ignore_chars=".()")

[10]:

True

counthai() returns proportion of Thai characters in the text. It will ignore non-alphabets by default.

[11]:

pythainlp.util.countthai("วันอาทิตย์ที่ 24 มีนาคม 2562")

[11]:

100.0

You can specify characters to be ignored, using ignore_chars= parameter.

[12]:

pythainlp.util.countthai("วันอาทิตย์ที่ 24 มีนาคม 2562", ignore_chars="")

[12]:

67.85714285714286

Collation

Sorting according to Thai dictionary.

[13]:

from pythainlp.util import collate

thai_words = ["ค้อน", "กระดาษ", "กรรไกร", "ไข่", "ผ้าไหม"]
collate(thai_words)

[13]:

['กรรไกร', 'กระดาษ', 'ไข่', 'ค้อน', 'ผ้าไหม']

[14]:

collate(thai_words, reverse=True)

[14]:

['ผ้าไหม', 'ค้อน', 'ไข่', 'กระดาษ', 'กรรไกร']

Date/Time Format and Spellout

Date/Time Format

Get Thai day and month names with Thai Buddhist Era (B.E.). Use formatting directives similar to datetime.strftime().

[15]:

import datetime
from pythainlp.util import thai_strftime

fmt = "%Aที่ %-d %B พ.ศ. %Y เวลา %H:%M น. (%a %d-%b-%y)"
date = datetime.datetime(1976, 10, 6, 1, 40)

thai_strftime(date, fmt)

[15]:

'วันพุธที่ 6 ตุลาคม พ.ศ. 2519 เวลา 01:40 น. (พ 06-ต.ค.-19)'

From version 2.2, these modifiers can be applied right before the main directive:

- (minus) Do not pad a numeric result string (also available in version 2.1)
_ (underscore) Pad a numeric result string with spaces
0 (zero) Pad a number result string with zeros
^ Convert alphabetic characters in result string to upper case
# Swap the case of the result string
O (letter o) Use the locale’s alternative numeric symbols (Thai digit)

[16]:

thai_strftime(date, "%d %b %y")

[16]:

'06 ต.ค. 19'

[17]:

thai_strftime(date, "%d %b %Y")

[17]:

'06 ต.ค. 2519'

Time Spellout

Note: ``thai_time()`` will be renamed to ``time_to_thaiword()`` in version 2.2.

[18]:

from pythainlp.util import thai_time

thai_time("00:14:29")

[18]:

'ศูนย์นาฬิกาสิบสี่นาทียี่สิบเก้าวินาที'

The way to spellout can be chosen, using fmt parameter. It can be 24h, 6h, or m6h. Try one by yourself.

[19]:

thai_time("00:14:29", fmt="6h")

[19]:

'เที่ยงคืนสิบสี่นาทียี่สิบเก้าวินาที'

Precision of spellout can be chosen as well. Using precision parameter. It can be m for minute-level, s for second-level, or None for only read the non-zero value.

[20]:

thai_time("00:14:29", precision="m")

[20]:

'ศูนย์นาฬิกาสิบสี่นาที'

[21]:

print(thai_time("8:17:00", fmt="6h"))
print(thai_time("8:17:00", fmt="m6h", precision="s"))
print(thai_time("18:30:01", fmt="m6h", precision="m"))
print(thai_time("13:30:01", fmt="6h", precision="m"))

สองโมงเช้าสิบเจ็ดนาที
แปดโมงสิบเจ็ดนาทีศูนย์วินาที
หกโมงครึ่ง
บ่ายโมงครึ่ง

We can also pass datetime and time objects to thai_time().

[22]:

import datetime

time = datetime.time(13, 14, 15)
thai_time(time)

[22]:

'สิบสามนาฬิกาสิบสี่นาทีสิบห้าวินาที'

[23]:

time = datetime.datetime(10, 11, 12, 13, 14, 15)
thai_time(time, fmt="6h", precision="m")

[23]:

'บ่ายโมงสิบสี่นาที'

Tokenization and Segmentation

At sentence, word, and sub-word levels.

Sentence

Default sentence tokenizer is “crfcut”. Tokenization engine can be chosen ussing engine= parameter.

[24]:

from pythainlp import sent_tokenize

text = ("พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ "
        "เป็นรัฐธรรมนูญฉบับชั่วคราว ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม "
        "ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 "
        "โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร")

print("default (crfcut):")
print(sent_tokenize(text))
print("\nwhitespace+newline:")
print(sent_tokenize(text, engine="whitespace+newline"))

default (crfcut):
['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ เป็นรัฐธรรมนูญฉบับชั่วคราว ', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม ', 'ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 ', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร']

whitespace+newline:
['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว', 'พุทธศักราช', '๒๔๗๕', 'เป็นรัฐธรรมนูญฉบับชั่วคราว', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม', 'ประกาศใช้เมื่อวันที่', '27', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่', '24', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยคณะราษฎร']

Word

Default word tokenizer (“newmm”) use maximum matching algorithm.

[25]:

from pythainlp import word_tokenize

text = "ก็จะรู้ความชั่วร้ายที่ทำไว้     และคงจะไม่ยอมให้ทำนาบนหลังคน "

print("default (newmm):")
print(word_tokenize(text))
print("\nnewmm and keep_whitespace=False:")
print(word_tokenize(text, keep_whitespace=False))

default (newmm):
['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', '     ', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน', ' ']

newmm and keep_whitespace=False:
['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน']

Other algorithm can be chosen. We can also create a tokenizer with a custom dictionary.

[3]:

from pythainlp import word_tokenize, Tokenizer

text = "กฎหมายแรงงานฉบับปรับปรุงใหม่ประกาศใช้แล้ว"

print("newmm  :", word_tokenize(text))  # default engine is "newmm"
print("longest:", word_tokenize(text, engine="longest"))

words = ["แรงงาน"]
custom_tokenizer = Tokenizer(words)
print("newmm (custom dictionary):", custom_tokenizer.word_tokenize(text))

newmm  : ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศ', 'ใช้แล้ว']
longest: ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศใช้', 'แล้ว']
newmm (custom dictionary): ['กฎหมาย', 'แรงงาน', 'ฉบับปรับปรุงใหม่ประกาศใช้แล้ว']

Default word tokenizer use a word list from pythainlp.corpus.common.thai_words(). We can get that list, add/remove words, and create new tokenizer from the modified list.

[4]:

from pythainlp.corpus.common import thai_words
from pythainlp import Tokenizer

text = "นิยายวิทยาศาสตร์ของไอแซค อสิมอฟ"

print("default dictionary:", word_tokenize(text))

words = set(thai_words())  # thai_words() returns frozenset
words.add("ไอแซค")  # Isaac
words.add("อสิมอฟ")  # Asimov
custom_tokenizer = Tokenizer(words)
print("custom dictionary :", custom_tokenizer.word_tokenize(text))

default dictionary: ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิ', 'มอ', 'ฟ']
custom dictionary : ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิมอฟ']

We can also, alternatively, create a dictionary trie, using pythainlp.util.Trie() function, and pass it to a default tokenizer.

[5]:

from pythainlp.corpus.common import thai_words
from pythainlp.util import Trie

text = "ILO87 ว่าด้วยเสรีภาพในการสมาคมและการคุ้มครองสิทธิในการรวมตัว ILO98 ว่าด้วยสิทธิในการรวมตัวและการร่วมเจรจาต่อรอง"

print("default dictionary:", word_tokenize(text))

new_words = {"ILO87", "ILO98", "การร่วมเจรจาต่อรอง", "สิทธิในการรวมตัว", "เสรีภาพในการสมาคม", "แรงงานสัมพันธ์"}
words = new_words.union(thai_words())

custom_dictionary_trie = Trie(words)
print("custom dictionary :", word_tokenize(text, custom_dict=custom_dictionary_trie))

default dictionary: ['ILO', '87', ' ', 'ว่าด้วย', 'เสรีภาพ', 'ใน', 'การสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', ' ', 'ILO', '98', ' ', 'ว่าด้วย', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', 'และ', 'การ', 'ร่วม', 'เจรจา', 'ต่อรอง']
custom dictionary : ['ILO87', ' ', 'ว่าด้วย', 'เสรีภาพในการสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิในการรวมตัว', ' ', 'ILO98', ' ', 'ว่าด้วย', 'สิทธิในการรวมตัว', 'และ', 'การร่วมเจรจาต่อรอง']

Testing different tokenization engines

[29]:

speedtest_text = """
ครบรอบ 14 ปี ตากใบ เช้าวันนั้น 25 ต.ค. 2547 ผู้ชุมนุมชายกว่า 1,370 คน
ถูกโยนขึ้นรถยีเอ็มซี 22 หรือ 24 คัน นอนซ้อนกันคันละ 4-5 ชั้น เดินทางจากสถานีตำรวจตากใบ ไปไกล 150 กิโลเมตร
ไปถึงค่ายอิงคยุทธบริหาร ใช้เวลากว่า 6 ชั่วโมง / ในอีกคดีที่ญาติฟ้องร้องรัฐ คดีจบลงที่การประนีประนอมยอมความ
กระทรวงกลาโหมจ่ายค่าสินไหมทดแทนรวม 42 ล้านบาทให้กับญาติผู้เสียหาย 79 ราย
ปิดหีบและนับคะแนนเสร็จแล้ว ที่หน่วยเลือกตั้งที่ 32 เขต 13 แขวงหัวหมาก เขตบางกะปิ กรุงเทพมหานคร
ผู้สมัคร ส.ส. และตัวแทนพรรคการเมืองจากหลายพรรคต่างมาเฝ้าสังเกตการนับคะแนนอย่างใกล้ชิด โดย
ฐิติภัสร์ โชติเดชาชัยนันต์ จากพรรคพลังประชารัฐ และพริษฐ์ วัชรสินธุ จากพรรคประชาธิปัตย์ได้คะแนน
96 คะแนนเท่ากัน
เช้าวันอาทิตย์ที่ 21 เมษายน 2019 ซึ่งเป็นวันอีสเตอร์ วันสำคัญของชาวคริสต์
เกิดเหตุระเบิดต่อเนื่องในโบสถ์คริสต์และโรงแรมอย่างน้อย 7 แห่งในประเทศศรีลังกา
มีผู้เสียชีวิตแล้วอย่างน้อย 156 คน และบาดเจ็บหลายร้อยคน ยังไม่มีข้อมูลว่าผู้ก่อเหตุมาจากฝ่ายใด
จีนกำหนดจัดการประชุมข้อริเริ่มสายแถบและเส้นทางในช่วงปลายสัปดาห์นี้ ปักกิ่งยืนยันว่า
อภิมหาโครงการเชื่อมโลกของจีนไม่ใช่เครื่องมือแผ่อิทธิพล แต่ยินดีรับฟังข้อวิจารณ์ เช่น ประเด็นกับดักหนี้สิน
และความไม่โปร่งใส รัฐบาลปักกิ่งบอกว่า เวทีประชุม Belt and Road Forum ในช่วงวันที่ 25-27 เมษายน
ถือเป็นงานการทูตที่สำคัญที่สุดของจีนในปี 2019
"""

[30]:

# Speed test: Calling "longest" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="longest")

CPU times: user 253 ms, sys: 2.27 ms, total: 256 ms
Wall time: 255 ms

[31]:

# Speed test: Calling "newmm" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="newmm")

CPU times: user 3.4 ms, sys: 60 µs, total: 3.46 ms
Wall time: 3.47 ms

[32]:

# Speed test: Calling "newmm" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="newmm-safe")

CPU times: user 4.08 ms, sys: 88 µs, total: 4.16 ms
Wall time: 4.15 ms

[33]:

#!pip install attacut
# Speed test: Calling "attacut" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="attacut")

CPU times: user 833 ms, sys: 174 ms, total: 1.01 s
Wall time: 576 ms

Get all possible segmentations

[34]:

from pythainlp.tokenize.multi_cut import find_all_segment, mmcut, segment

find_all_segment("มีความเป็นไปได้อย่างไรบ้าง")

[34]:

['มี|ความ|เป็น|ไป|ได้|อย่าง|ไร|บ้าง|',
 'มี|ความ|เป็นไป|ได้|อย่าง|ไร|บ้าง|',
 'มี|ความ|เป็นไปได้|อย่าง|ไร|บ้าง|',
 'มี|ความเป็นไป|ได้|อย่าง|ไร|บ้าง|',
 'มี|ความเป็นไปได้|อย่าง|ไร|บ้าง|',
 'มี|ความ|เป็น|ไป|ได้|อย่างไร|บ้าง|',
 'มี|ความ|เป็นไป|ได้|อย่างไร|บ้าง|',
 'มี|ความ|เป็นไปได้|อย่างไร|บ้าง|',
 'มี|ความเป็นไป|ได้|อย่างไร|บ้าง|',
 'มี|ความเป็นไปได้|อย่างไร|บ้าง|',
 'มี|ความ|เป็น|ไป|ได้|อย่างไรบ้าง|',
 'มี|ความ|เป็นไป|ได้|อย่างไรบ้าง|',
 'มี|ความ|เป็นไปได้|อย่างไรบ้าง|',
 'มี|ความเป็นไป|ได้|อย่างไรบ้าง|',
 'มี|ความเป็นไปได้|อย่างไรบ้าง|']

Subword, syllable, and Thai Character Cluster (TCC)

Tokenization can also be done at subword level, either syllable or Thai Character Cluster (TCC).

Syllable segmentation is using `ssg <https://github.com/ponrawee/ssg>`__, a CRF syllable segmenter for Thai by Ponrawee Prasertsom.
TCC is smaller than syllable. For information about TCC, see Character Cluster Based Thai Information Retrieval (Theeramunkong et al. 2004).

Subword tokenization

Default subword tokenization engine is tcc, which will use Thai Character Cluster (TCC) as a subword unit.

[35]:

from pythainlp import subword_tokenize

subword_tokenize("ประเทศไทย")  # default subword unit is TCC

[35]:

['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']

Syllable tokenization

Default syllable tokenization engine is dict, which will use newmm word tokenization engine with a custom dictionary contains known syllables in Thai language.

[36]:

from pythainlp.tokenize import syllable_tokenize

text = "อับดุลเลาะ อีซอมูซอ สมองบวมรุนแรง"

syllable_tokenize(text)  # default engine is "dict"

[36]:

['อับ',
 'ดุล',
 'เลาะ',
 ' ',
 'อี',
 'ซอ',
 'มู',
 'ซอ',
 ' ',
 'สมอง',
 'บวม',
 'รุน',
 'แรง']

External `ssg <https://github.com/ponrawee/ssg>`__ engine call be called. Note that ssg engine ommitted whitespaces in the output tokens.

[37]:

syllable_tokenize(text, engine="ssg")  # use "ssg" for syllable

[37]:

['อับ', 'ดุล', 'เลาะ', ' อี', 'ซอ', 'มู', 'ซอ ', 'สมอง', 'บวม', 'รุน', 'แรง']

Low-level subword operations

These low-level TCC operations can be useful for some pre-processing tasks. Like checking if it’s ok to cut a string at a certain point or to find typos.

[38]:

from pythainlp.tokenize import tcc

tcc.segment("ประเทศไทย")

[38]:

['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']

[39]:

tcc.tcc_pos("ประเทศไทย")  # return positions

[39]:

{1, 3, 5, 6, 8, 9}

[40]:

for ch in tcc.tcc("ประเทศไทย"):  # TCC generator
    print(ch, end='-')

ป-ระ-เท-ศ-ไท-ย-

Transliteration

There are two types of transliteration here: romanization and transliteration.

Romanization will render Thai words in the Latin alphabet using the Royal Thai General System of Transcription (RTGS).
- Two engines are supported here: a simple royin engine (default) and a more accurate thai2rom engine.
Transliteration here, in PyThaiNLP context, means the sound representation of a string.
- Two engines are supported here: ipa (International Phonetic Alphabet system, using Epitran) (default) and icu (International Components for Unicode, using PyICU).

[41]:

from pythainlp.transliterate import romanize

romanize("แมว")  # output: 'maeo'

[41]:

'maeo'

[42]:

romanize("ภาพยนตร์")  # output: 'phapn' (*obviously wrong)

[42]:

'phapn'

[43]:

from pythainlp.transliterate import transliterate

transliterate("แมว")  # output: 'mɛːw'

Update Corpus...
Corpus: thai-g2p
- Already up to date.

[43]:

'm ɛː w ˧'

[44]:

transliterate("ภาพยนตร์")  # output: 'pʰaːpjanot'

[44]:

'pʰ aː p̚ ˥˩ . pʰ a ˦˥ . j o n ˧'

Normalization

normalize() removes zero-width spaces (ZWSP and ZWNJ), duplicated spaces, repeating vowels, and dangling characters. It also reorder vowels and tone marks during the process of removing repeating vowels.

[45]:

from pythainlp.util import normalize

normalize("เเปลก") == "แปลก"  # เ เ ป ล ก  vs แ ป ล ก

[45]:

True

The string below contains a non-standard order of Thai characters, Sara Aa (following vowel) + Mai Ek (upper tone mark). normalize() will reorder it to Mai Ek + Sara Aa.

[46]:

text = "เกา่"
normalize(text)

[46]:

'เก่า'

This can be useful for string matching, including tokenization.

[47]:

from pythainlp import word_tokenize

text = "เก็บวันน้ี พรุ่งน้ีก็เกา่"

print("tokenize immediately:")
print(word_tokenize(text))
print("\nnormalize, then tokenize:")
print(word_tokenize(normalize(text)))

tokenize immediately:
['เก็บ', 'วัน', 'น้ี', ' ', 'พรุ่งน้ี', 'ก็', 'เกา', '่']

normalize, then tokenize:
['เก็บ', 'วันนี้', ' ', 'พรุ่งนี้', 'ก็', 'เก่า']

The string below contains repeating vowels (multiple Sara A in a row) normalize() will keep only one of them. It can be use to reduce variations in spellings, useful for classification task.

[48]:

normalize("เกะะะ")

[48]:

'เกะ'

Internally, normalize() is just a series of function calls like this:

text = remove_zw(text)
text = remove_dup_spaces(text)
text = remove_repeat_vowels(text)
text = remove_dangling(text)

If you don’t like the behavior of default normalize(), you can call those functions shown above, also remove_tonemark() and reorder_vowels(), individually from pythainlp.util, to customize your own normalization.

Digit conversion

Thai text sometimes use Thai digits. This can reduce performance for classification and searching. PyThaiNP provides few utility functions to deal with this.

[49]:

from pythainlp.util import arabic_digit_to_thai_digit, thai_digit_to_arabic_digit, digit_to_text

text = "ฉุกเฉินที่ยุโรปเรียก 112 ๑๑๒"

arabic_digit_to_thai_digit(text)

[49]:

'ฉุกเฉินที่ยุโรปเรียก ๑๑๒ ๑๑๒'

[50]:

thai_digit_to_arabic_digit(text)

[50]:

'ฉุกเฉินที่ยุโรปเรียก 112 112'

[51]:

digit_to_text(text)

[51]:

'ฉุกเฉินที่ยุโรปเรียก หนึ่งหนึ่งสอง หนึ่งหนึ่งสอง'

Soundex

“Soundex is a phonetic algorithm for indexing names by sound.” (Wikipedia). PyThaiNLP provides three kinds of Thai soundex.

[52]:

from pythainlp.soundex import lk82, metasound, udom83

# check equivalence
print(lk82("รถ") == lk82("รด"))
print(udom83("วรร") == udom83("วัน"))
print(metasound("นพ") == metasound("นภ"))

True
True
True

[53]:

texts = ["บูรณะ", "บูรณการ", "มัก", "มัค", "มรรค", "ลัก", "รัก", "รักษ์", ""]
for text in texts:
    print(
        "{} - lk82: {} - udom83: {} - metasound: {}".format(
            text, lk82(text), udom83(text), metasound(text)
        )
    )

บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550
บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551
มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100
มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100
มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551
ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100
รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100
รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100
 - lk82:  - udom83:  - metasound:

Spellchecking

Default spellchecker uses Peter Norvig’s algorithm together with word frequency from Thai National Corpus (TNC).

spell() returns a list of all possible spellings.

[54]:

from pythainlp import spell

spell("เหลืยม")

[54]:

['เหลียม', 'เหลือม']

correct() returns the most likely spelling.

[55]:

from pythainlp import correct

correct("เหลืยม")

[55]:

'เหลียม'

Spellchecking - Custom dictionary and word frequency

Custom dictionary can be provided when creating spellchecker.

When create a NorvigSpellChecker object, you can pass a custom dictionary to custom_dict parameter.

custom_dict can be: - a dictionary (dict), with words (str) as keys and frequencies (int) as values; or - a list, a tuple, or a set of (word, frequency) tuples; or - a list, a tuple, or a set of just words, without their frequencies – in this case 1 will be assigned to every words.

[56]:

from pythainlp.spell import NorvigSpellChecker

user_dict = [("เหลียม", 50), ("เหลือม", 1000), ("เหลียว", 1000000)]
checker = NorvigSpellChecker(custom_dict=user_dict)

checker.spell("เหลืยม")

[56]:

['เหลือม', 'เหลียม']

As you can see, our version of NorvigSpellChecker gives the edit distance a priority over a word frequency.

You can use word frequencies from Thai National Corpus and Thai Textbook Corpus as well.

By default, NorvigSpellChecker uses Thai National Corpus.

[57]:

from pythainlp.corpus import ttc  # Thai Textbook Corpus

checker = NorvigSpellChecker(custom_dict=ttc.word_freqs())

checker.spell("เหลืยม")

[57]:

['เหลือม']

[58]:

checker.correct("เหลืยม")

[58]:

'เหลือม'

To check the current dictionary of a spellchecker:

[59]:

list(checker.dictionary())[1:10]

[59]:

[('พิธีเปิด', 18),
 ('ไส้กรอก', 40),
 ('ปลิง', 6),
 ('เต็ง', 13),
 ('ขอบคุณ', 356),
 ('ประสาน', 84),
 ('รำไร', 11),
 ('ร่วมท้อง', 4),
 ('ฝักมะขาม', 3)]

We can also apply conditions and filter function to dictionary when creating spellchecker.

[60]:

checker = NorvigSpellChecker()  # use default filter (remove any word with number or non-Thai character)
len(checker.dictionary())

[60]:

[61]:

checker = NorvigSpellChecker(min_freq=5, min_len=2, max_len=15)
len(checker.dictionary())

[61]:

[62]:

checker_no_filter = NorvigSpellChecker(dict_filter=None)  # use no filter
len(checker_no_filter.dictionary())

[62]:

[63]:

def remove_yamok(word):
    return False if "ๆ" in word else True

checker_custom_filter = NorvigSpellChecker(dict_filter=remove_yamok)  # use custom filter
len(checker_custom_filter.dictionary())

[63]:

Part-of-Speech Tagging

[64]:

from pythainlp.tag import pos_tag, pos_tag_sents

pos_tag(["การ","เดินทาง"])

[64]:

[('การ', 'FIXN'), ('เดินทาง', 'VACT')]

[65]:

sents = [["ประกาศสำนักนายกฯ", " ", "ให้",
    " ", "'พล.ท.สรรเสริญ แก้วกำเนิด'", " ", "พ้นจากตำแหน่ง",
    " ", "ผู้ทรงคุณวุฒิพิเศษ", "กองทัพบก", " ", "กระทรวงกลาโหม"],
    ["และ", "แต่งตั้ง", "ให้", "เป็น", "'อธิบดีกรมประชาสัมพันธ์'"]]

pos_tag_sents(sents)

[65]:

[[('ประกาศสำนักนายกฯ', 'NCMN'),
  (' ', 'PUNC'),
  ('ให้', 'JSBR'),
  (' ', 'PUNC'),
  ("'พล.ท.สรรเสริญ แก้วกำเนิด'", 'NCMN'),
  (' ', 'PUNC'),
  ('พ้นจากตำแหน่ง', 'NCMN'),
  (' ', 'PUNC'),
  ('ผู้ทรงคุณวุฒิพิเศษ', 'NCMN'),
  ('กองทัพบก', 'NCMN'),
  (' ', 'PUNC'),
  ('กระทรวงกลาโหม', 'NCMN')],
 [('และ', 'JCRG'),
  ('แต่งตั้ง', 'VACT'),
  ('ให้', 'JSBR'),
  ('เป็น', 'VSTA'),
  ("'อธิบดีกรมประชาสัมพันธ์'", 'NCMN')]]

Named-Entity Tagging

The tagger use BIO scheme: - B - beginning of entity - I - inside entity - O - outside entity

[66]:

#!pip3 install pythainlp[ner]
from pythainlp.tag.named_entity import ThaiNameTagger

ner = ThaiNameTagger()
ner.get_ner("24 มิ.ย. 2563 ทดสอบระบบเวลา 6:00 น. เดินทางจากขนส่งกรุงเทพใกล้ถนนกำแพงเพชร ไปจังหวัดกำแพงเพชร ตั๋วราคา 297 บาท")

[66]:

[('24', 'NUM', 'B-DATE'),
 (' ', 'PUNCT', 'I-DATE'),
 ('มิ.ย.', 'NOUN', 'I-DATE'),
 (' ', 'PUNCT', 'O'),
 ('2563', 'NUM', 'O'),
 (' ', 'PUNCT', 'O'),
 ('ทดสอบ', 'VERB', 'O'),
 ('ระบบ', 'NOUN', 'O'),
 ('เวลา', 'NOUN', 'O'),
 (' ', 'PUNCT', 'O'),
 ('6', 'NUM', 'B-TIME'),
 (':', 'PUNCT', 'I-TIME'),
 ('00', 'NUM', 'I-TIME'),
 (' ', 'PUNCT', 'I-TIME'),
 ('น.', 'NOUN', 'I-TIME'),
 (' ', 'PUNCT', 'O'),
 ('เดินทาง', 'VERB', 'O'),
 ('จาก', 'ADP', 'O'),
 ('ขนส่ง', 'NOUN', 'B-ORGANIZATION'),
 ('กรุงเทพ', 'NOUN', 'I-ORGANIZATION'),
 ('ใกล้', 'ADJ', 'O'),
 ('ถนน', 'NOUN', 'B-LOCATION'),
 ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),
 (' ', 'PUNCT', 'O'),
 ('ไป', 'AUX', 'O'),
 ('จังหวัด', 'VERB', 'B-LOCATION'),
 ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),
 (' ', 'PUNCT', 'O'),
 ('ตั๋ว', 'NOUN', 'O'),
 ('ราคา', 'NOUN', 'O'),
 (' ', 'PUNCT', 'O'),
 ('297', 'NUM', 'B-MONEY'),
 (' ', 'PUNCT', 'I-MONEY'),
 ('บาท', 'NOUN', 'I-MONEY')]

Word Vector

[67]:

import pythainlp.word_vector

pythainlp.word_vector.similarity("คน", "มนุษย์")

[67]:

0.2504981

[68]:

pythainlp.word_vector.doesnt_match(["คน", "มนุษย์", "บุคคล", "เจ้าหน้าที่", "ไก่"])

/usr/local/lib/python3.7/site-packages/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)

[68]:

'ไก่'

Number Spell Out

[69]:

from pythainlp.util import bahttext

bahttext(1234567890123.45)

[69]:

'หนึ่งล้านสองแสนสามหมื่นสี่พันห้าร้อยหกสิบเจ็ดล้านแปดแสนเก้าหมื่นหนึ่งร้อยยี่สิบสามบาทสี่สิบห้าสตางค์'

bahttext() will round the satang part

[70]:

bahttext(1.909)

[70]:

'หนึ่งบาทเก้าสิบเอ็ดสตางค์'