pythainlp.util

The pythainlp.util contains utility functions, like text conversion and formatting

Modules

pythainlp.util.arabic_digit_to_thai_digit(text: str)str[source]

This function convert Arabic digits (i.e. 1, 3, 10) to Thai digits (i.e. ๑, ๓, ๑๐).

Parameters

text (str) – Text with Arabic digits such as ‘1’, ‘2’, ‘3’

Returns

Text with Arabic digits being converted to Thai digits such as ‘๑’, ‘๒’, ‘๓’

Return type

str

Example

from pythainlp.util import arabic_digit_to_thai_digit

text = 'เป็นจำนวน 123,400.25 บาท'

arabic_digit_to_thai_digit(text)
# output: เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท
pythainlp.util.bahttext(number: float)str[source]

This function converts a number to Thai text and adds a suffix “บาท” (Baht). The precision will be fixed at two decimal places (0.00) to fits “สตางค์” (Satang) unit. This function works similar to BAHTTEXT function in MS Excel.

Parameters

number (float) – number to be converted into Thai Baht currency format

Returns

text representing the amount of money in the format of Thai currency

Return type

str

Example

from pythainlp.util import bahttext

bahttext(1)
# output: หนึ่งบาทถ้วน

bahttext(21)
# output: ยี่สิบเอ็ดบาทถ้วน

bahttext(200)
# output: สองร้อยบาทถ้วน
pythainlp.util.collate(data: Iterable, reverse: bool = False)List[str][source]

This function sorts a list of strings according to Thai alphabets.

Parameters
  • data (list[str]) – a list of words to be sorted

  • reverse (bool) – If reverse is set to True the result will be sorted in descending order. Otherwise, the result will be sorted in ascending order. By default, the parameter reverse is set to False, sorting alphabettically in ascending order.

Returns

a list of strings, sorted alphabetically, according to Thai alphabets

Return type

list[str]

Example

from pythainlp.util import collate

collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'])
# output: ['กาล', 'เกิด', 'ไก่', 'เป็ด', 'วันที่', 'วัว', 'หมู']

collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'], \
    reverse=True)
# output: ['หมู', 'วัว', 'วันที่', 'เป็ด', 'ไก่', 'เกิด', 'กาล']
pythainlp.util.delete_tone(text: str)str[source]

This function removes Thai tonemarks from the text. There are 4 tonemarks indicating 4 tones as follows:

  • Down tone (Thai: ไม้เอก _่ )

  • Falling tone (Thai: ไม้โท _้ )

  • High tone (Thai: ไม้ตรี ​_๊ )

  • Rising tone (Thai: ไม้จัตวา _๋ )

Parameters

text (str) – text in Thai language

Returns

text without Thai tonemarks

Return type

str

Example

from pythainlp.util import delete_tone

delete_tone('สองพันหนึ่งร้อยสี่สิบเจ็ดล้านสี่แสนแปดหมื่นสามพันหกร้อยสี่สิบเจ็ด')
# output: สองพันหนึงรอยสีสิบเจ็ดลานสีแสนแปดหมืนสามพันหกรอยสีสิบเจ็ด
pythainlp.util.digit_to_text(text: str)str[source]
Parameters

text (str) – Text with digits such as ‘1’, ‘2’, ‘๓’, ‘๔’

Returns

Text with digits being spelled out in Thai

pythainlp.util.eng_to_thai(text: str)str[source]

Corrects the given text that was incorrectly typed using English-US Qwerty keyboard layout to the originally intended keyboard layout that is the Thai Kedmanee keyboard.

Parameters

text (str) – incorrect text input (type Thai with English keyboard)

Returns

Thai text where incorrect typing with a keyboard layout is corrected

Return type

str

Example

Intentionally type “ธนาคารแห่งประเทศไทย”, but got “Tok8kicsj’xitgmLwmp”:

from pythainlp.util import eng_to_thai

eng_to_thai("Tok8kicsj'xitgmLwmp")
# output: ธนาคารแห่งประเทศไทย
pythainlp.util.find_keyword(word_list: List[str], min_len: int = 3)Dict[str, int][source]

This function count the frequency of words in the list where stopword is excluded and returns as a frequency dictionary.

Parameters
  • word_list (list) – a list of words

  • min_len (int) – the mininum frequency for words to obtain

Returns

a dictionary object with key-value pair as word and its raw count

Return type

dict[str, int]

Example

from pythainlp.util import find_keyword

words = ["บันทึก", "เหตุการณ์", "บันทึก", "เหตุการณ์",
         " ", "มี", "การ", "บันทึก", "เป็น", " ", "ลายลักษณ์อักษร"
         "และ", "การ", "บันทึก","เสียง","ใน","เหตุการณ์"]

find_keyword(words)
# output: {'บันทึก': 4, 'เหตุการณ์': 3}

find_keyword(words, min_len=1)
# output: {' ': 2, 'บันทึก': 4, 'ลายลักษณ์อักษรและ': 1,
 'เสียง': 1, 'เหตุการณ์': 3}
pythainlp.util.countthai(text: str, ignore_chars: str = ' \t\n\r\x0b\x0c0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')float[source]

This function calculates percentage of Thai characters in the text with an option to ignored some characters.

Parameters
  • text (str) – input text

  • ignore_chars (str) – string of characters to ignore from counting. By default, the ignored characters are whitespace, newline, digits, and punctuation.

Returns

percentage of Thai characters in the text

Return type

float

Example

Find the percentage of Thai characters in the textt with default ignored characters set (whitespace, newline character, punctuation and digits):

from pythainlp.util import countthai

countthai("ดอนัลด์ จอห์น ทรัมป์ English: Donald John Trump")
# output: 45.0

countthai("(English: Donald John Trump)")
# output: 0.0

Find the percentage of Thai characters in the text while ignoring only punctuation but not whitespace, newline character and digits:

import string

string.punctuation
# output: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

countthai("ดอนัลด์ จอห์น ทรัมป์ English: Donald John Trump", \
    ignore_chars=string.punctuation)
# output: 39.130434782608695

countthai("ดอนัลด์ จอห์น ทรัมป์ (English: Donald John Trump)", \
    ignore_chars=string.punctuation)
# output: 0.0
pythainlp.util.is_native_thai(word: str)bool[source]

Check if a word is an “native Thai word” (Thai: “คำไทยแท้”) This function based on a simple heuristic algorithm and cannot be entirely reliable.

Parameters

word (str) – word

Returns

True or False

Return type

bool

Example

English word:

from pythainlp.util import is_native_thai

is_native_thai("Avocado")
# output: False

Native Thai word:

is_native_thai("มะม่วง")
# output: True
is_native_thai("ตะวัน")
# output: True

Non-native Thai word:

is_native_thai("สามารถ")
# output: False
is_native_thai("อิสริยาภรณ์")
# output: False
pythainlp.util.isthai(word: str, ignore_chars: str = '.')bool[source]

This function checks if all character in the input string are Thai character.

Parameters
  • word (str) – input text

  • ignore_chars (str) – string characters to be ignored (i.e. will be considered as Thai)

Returns

returns True if the input text all contains Thai characters, otherwise returns False

Return type

bool

Example

Check if all character is Thai character. By default, it ignores only full stop (“.”):

from pythainlp.util import isthai

isthai("กาลเวลา")
# output: True

isthai("กาลเวลา.")
# output: True

Explicitly ignore digits, whitespace, and the following characters (“-“, “.”, “$”, “,”):

from pythainlp.util import isthai

isthai("กาลเวลา, การเวลา-ก,  3.75$", ignore_chars="1234567890.-,$ ")
# output: True
pythainlp.util.isthaichar(ch: str)bool[source]

This function checks if the input character is a Thai character.

Parameters

ch (str) – input character

Returns

returns True if the input character is a Thai characttr, otherwise returns False

Return type

bool

Example

from pythainlp.util import isthaichar

isthaichar("ก") # THAI CHARACTER KO KAI
# output: True

isthaichar("๐") # THAI DIGIT ZERO
# output: True

isthaichar("๕") # THAI DIGIT FIVE
# output: True
pythainlp.util.normalize(text: str)str[source]

This function normalize thai text with normalizing rules as follows:

  • Remove redudant symbol of tones and vowels.

  • Subsitute [“เ”, “เ”] to “แ”.

Parameters

text (str) – thai text to be normalized

Returns

normalized Thai text according to the fules

Return type

str

Example

from pythainlp.util import normalize

normalize('สระะน้ำ')
# output: สระน้ำ

normalize('เเปลก')
# output: แปลก

normalize('นานาาา')
# output: นานา
pythainlp.util.now_reign_year()[source]

This function return the reign year for the 10th King of Chakri dynasty.

Returns

reign year of the 10th King of Chakri dynasty.

Return type

int

Example

from pythainlp.util import now_reign_year

text = "เป็นปีที่ {reign_year} ในรัชกาลปัจจุบัน"\
    .format(reign_year=now_reign_year())

print(text)
# output: เป็นปีที่ 4 ในรัชการปัจจุบัน
pythainlp.util.num_to_thaiword(number: int)str[source]

This function convert number to Thai text

Parameters

number (int) – an integer number to be converted to Thai text

Returns

text representing the number in Thai

Return type

str

Example

from pythainlp.util import num_to_thaiword

num_to_thaiword(1)
# output: หนึ่ง

num_to_thaiword(11)
# output: สิบเอ็ด
pythainlp.util.rank(words: List[str], exclude_stopwords: bool = False)collections.Counter[source]

Count word frequecy given a list of Thai words with an option to exclude stopwords.

Parameters
  • words (list) – a list of words

  • exclude_stopwords (bool) – If this parameter is set to True to exclude stopwords from counting. Otherwise, the stopwords will be counted. By default, `exclude_stopwords`is set to False

Returns

a Counter object representing word frequency from the text

Return type

collections.Counter

Example

Include stopwords in counting word frequency:

from pythainlp.util import rank

words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", \
"เป็น", " ", "ลายลักษณ์อักษร"]

rank(words)
# output:
# Counter(
#     {
#         ' ': 2,
#         'การ': 1,
#         'บันทึก': 2,
#         'มี': 1,
#         'ลายลักษณ์อักษร': 1,
#         'เป็น': 1,
#         'เหตุการณ์': 1
#     })

Exclude stopword in counting word frequency:

from pythainlp.util import rank

words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", \
    "เป็น", " ", "ลายลักษณ์อักษร"]

rank(words)
# output:
# Counter(
#     {
#         ' ': 2,
#         'บันทึก': 2,
#         'ลายลักษณ์อักษร': 1,
#         'เหตุการณ์': 1
#     })
pythainlp.util.reign_year_to_ad(reign_year: int, reign: int)int[source]

This function calculate the AD year according to the reign year for the 7th to 10th King of Chakri dynasty, Thailand. For instance, the AD year of the 4th reign year of the 10th King is 2019.

Parameters
  • reign_year (int) – reign year of the King

  • reign (int) – the reign of the King (i.e. 7, 8, 9, and 10)

Returns

the year in AD of the King given the reign and reign year.

Return type

int

Example

from pythainlp.util import reign_year_to_ad

print("The 4th reign year of the King Rama X is in", \
    reign_year_to_ad(4, 10))
# output: The 4th reign year of the King Rama X is in 2019

print("The 1st reign year of the King Rama IX is in", \
    reign_year_to_ad(1, 9))
# output: The 4th reign year of the King Rama X is in 1946
pythainlp.util.thai_time(time_data: Union[datetime.time, datetime.datetime, str], fmt: str = '24h', precision: Optional[str] = None)str[source]

Spell out time to Thai words.

Parameters
  • time_data (str) – time input, can be a datetime.time object or a datetime.datetime object or a string (in H:M or H:M:S format, using 24-hour clock)

  • fmt (str) – time output format * 24h - 24-hour clock (default) * 6h - 6-hour clock * m6h - Modified 6-hour clock

  • precision (str) – precision of the spell out * m - always spell out to minute level * s - always spell out to second level * None - spell out only non-zero parts

Returns

Time spell out in Thai words

Return type

str

Example

thai_time(“8:17”) # output: # แปดนาฬิกาสิบเจ็ดนาที

thai_time(“8:17”, “6h”) # output: # สองโมงเช้าสิบเจ็ดนาที

thai_time(“8:17”, “m6h”) # output: # แปดโมงสิบเจ็ดนาที

thai_time(“18:30”, fmt=”m6h”) # output: # หกโมงครึ่ง

thai_time(datetime.time(12, 3, 0)) # output: # สิบสองนาฬิกาสามนาที

thai_time(datetime.time(12, 3, 0), precision=”s”) # output: # สิบสองนาฬิกาสามนาทีศูนย์วินาที

pythainlp.util.text_to_arabic_digit(text: str)str[source]

This function convert Thai spelled out digits to Arabic digits.

Parameters

text – A digit spelled out in Thai

Returns

An Arabic digit such as ‘1’, ‘2’, ‘3’ if the text is Thai digit spelled out (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.

Return type

str

Example

from pythainlp.util import text_to_arabic_digit

text_to_arabic_digit("ศูนย์")
# output: 0
text_to_arabic_digit("หนึ่ง")
# output: 1
text_to_arabic_digit("แปด")
# output: 8
text_to_arabic_digit("เก้า")
# output: 9

# For text that is not Thai digit spelled out
text_to_arabic_digit("สิบ") == ""
# output: True
text_to_arabic_digit("เก้าร้อย") == ""
# output: True
pythainlp.util.text_to_thai_digit(text: str)str[source]

This function convert Thai spelled out digits to Thai digits.

Parameters

text – A digit spelled out in Thai

Returns

A Thai digit such as ‘๑’, ‘๒’, ‘๓’ if the text is Thai digit spelled out (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.

Return type

str

Example

from pythainlp.util import text_to_thai_digit

text_to_thai_digit("ศูนย์")
# output: ๐
text_to_thai_digit("หนึ่ง")
# output: ๑
text_to_thai_digit("แปด")
# output: ๘
text_to_thai_digit("เก้า")
# output: ๙

# For text that is not Thai digit spelled out
text_to_thai_digit("สิบ") == ""
# output: True
text_to_thai_digit("เก้าร้อย") == ""
# output: True
pythainlp.util.thai_strftime(datetime: datetime.datetime, fmt: str, thaidigit: bool = False)str[source]

This function convert datetime.datetime into Thai date and time format. The formatting directives are similar to datatime.strrftime().

This function uses Thai names and Thai Buddhist Era for these directives:
  • %a - abbreviated weekday name (i.e. “จ”, “อ”, “พ”, “พฤ”, “ศ”, “ส”, “อา”)

  • %A - full weekday name (i.e. “วันจันทร์”, “วันอังคาร”, “วันเสาร์”, “วันอาทิตย์”)

  • %b - abbreviated month name (i.e. “ม.ค.”,”ก.พ.”,”มี.ค.”,”เม.ย.”,”พ.ค.”,”มิ.ย.”, “ธ.ค.”)

  • %B - full month name (i.e. “มกราคม”, “กุมภาพันธ์”, “พฤศจิกายน”, “ธันวาคม”,)

  • %y - year without century (i.e. “56”, “10”)

  • %Y - year with century (i.e. “2556”, “2410”)

  • %c - date and time representation (i.e. “พ 6 ต.ค. 01:40:00 2519”)

  • %v - short date representation (i.e. ” 6-ม.ค.-2562”, “27-ก.พ.-2555”)

Other directives will be passed to datetime.strftime()

Note
  • The Thai Buddhist Era (BE) year is simply converted from AD by adding 543. This is certainly not accurate for years before 1941 AD, due to the change in Thai New Year’s Day.

  • This meant to be an interrim solution, since Python standard’s locale module (which relied on C’s strftime()) does not support “th” or “th_TH” locale yet. If supported, we can just locale.setlocale(locale.LC_TIME, “th_TH”) and then use native datetime.strftime().

We trying to make this platform-independent and support extentions as many as possible, See these links for strftime() extensions in POSIX, BSD, and GNU libc:

Parameters
  • datetime (datetime.datetime) – an instantiatetd object of datetime.datetime

  • fmt (str) – string containing date and time directives

  • thaidigit (bool) – If thaidigit is set to False (default), number will be represented in Arabic digit. If it is set to True, it will be represented in Thai digit.

Returns

Date and time text, with month in Thai name and year in Thai Buddhist era. The year is simply converted from AD by adding 543 (will not accurate for years before 1941 AD, due to change in Thai New Year’s Day).

Return type

str

Example

import datetime
from pythainlp.util import thai_strftime

datetime_object = datetime.datetime(year=2019, month=6, day=10, \
    hour=15, minute=59, second=0, microsecond=0)

print(datetime_object)
# output: 2019-06-10 15:59:00

print(thai_strftime(datetime_object, "%A %d %B %Y "))
# output: วันจันทร์ 10 มิถุนายน 2562

print(thai_strftime(datetime_object, "%a %d %b %y "))
# output: จ 10 มิ.ย. 62

print(thai_strftime(datetime_object, "%D (%v)"))
# output: 06/10/62 (10-มิ.ย.-2562)

print(thai_strftime(datetime_object, "%D (%c)"))
# output: 06/10/62 (จ  10 มิ.ย. 15:59:00 2562)
pythainlp.util.thai_to_eng(text: str)str[source]

Corrects the given text that was incorrectly typed using Thai Kedmanee keyboard layout to the originally intended keyboard layout that is the English-US Qwerty keyboard.

Parameters

text (str) – incorrect text input (type English with Thai keyboard)

Returns

English text where incorrect typing with a keyboard layout is corrected

Return type

str

Example

Intentionally type “Bank of Thailand”, but got “ฺฟืา นด ธ้ฟรสฟืก”:

from pythainlp.util import eng_to_thai

thai_to_eng("ฺฟืา นด ธ้ฟรสฟืก")
# output: 'Bank of Thailand'
pythainlp.util.thai_digit_to_arabic_digit(text: str)str[source]

This function convert Thai digits (i.e. ๑, ๓, ๑๐) to Arabic digits (i.e. 1, 3, 10).

Parameters

text (str) – Text with Thai digits such as ‘๑’, ‘๒’, ‘๓’

Returns

Text with Thai digits being converted to Arabic digits such as ‘1’, ‘2’, ‘3’

Return type

str

Example

from pythainlp.util import thai_digit_to_arabic_digit

text = 'เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท'

thai_digit_to_arabic_digit(text)
# output: เป็นจำนวน 123,400.25 บาท
pythainlp.util.thaiword_to_num(word: str)int[source]

Converts the spelled-out numerals in Thai scripts into an actual integer.

Parameters

word (str) – Spelled-out numerals in Thai scripts

Returns

Corresponding integer value of the input

Return type

int

Example

from pythainlp.util import thaiword_to_num

thaiword_to_num("ศูนย์")
# output: 0

thaiword_to_num("สองล้านสามแสนหกร้อยสิบสอง")
# output: 2300612