pythainlp.util

The pythainlp.util contains utility functions, like text conversion and formatting

Modules

pythainlp.util.arabic_digit_to_thai_digit(text: str)str[source]
Parameters

text (str) – Text with Arabic digits such as ‘1’, ‘2’, ‘3’

Returns

Text with Arabic digits being converted to Thai digits such as ‘๑’, ‘๒’, ‘๓’

pythainlp.util.bahttext(number: float)str[source]

Converts a number to Thai text and adds a suffix of “Baht” currency. Precision will be fixed at two decimal places (0.00) to fits “Satang” unit.

Similar to BAHTTEXT function in Excel

pythainlp.util.collate(data: Iterable, reverse: bool = False)List[str][source]
Parameters
  • data (list) – a list of strings to be sorted

  • reverse (bool) – reverse flag, set to get the result in descending order

Returns

a list of strings, sorted alphabetically, according to Thai rules

Example::
>>> from pythainlp.util import *
>>> collate(['ไก่', 'เป็ด', 'หมู', 'วัว'])
['ไก่', 'เป็ด', 'วัว', 'หมู']
pythainlp.util.deletetone(text: str)str[source]

Remove tonemarks

Parameters

text (str) – thai text

Returns

thai text

pythainlp.util.digit_to_text(text: str)str[source]
Parameters

text (str) – Text with digits such as ‘1’, ‘2’, ‘๓’, ‘๔’

Returns

Text with digits being spelled out in Thai

pythainlp.util.eng_to_thai(text: str)str[source]

Correct text in one language that is incorrectly-typed with a keyboard layout in another language. (type Thai with English keyboard)

Parameters

text (str) – Incorrect input (type Thai with English keyboard)

Returns

Thai text

pythainlp.util.find_keyword(word_list: List[str], min_len: int = 3)Dict[str, int][source]
Parameters
  • word_list (list) – a list of words

  • min_len (int) – a mininum length of keywords to look for

Returns

dict

pythainlp.util.countthai(text: str, ignore_chars: str = ' \t\n\r\x0b\x0c0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')float[source]
Parameters

text (str) – input text

Returns

float, proportion of characters in the text that is Thai character

pythainlp.util.isthai(word: str, ignore_chars: str = '.')bool[source]

Check if all character is Thai เป็นคำที่มีแต่อักษรไทยหรือไม่

Parameters
  • word (str) – input text

  • ignore_chars (str) – characters to be ignored (i.e. will be considered as Thai)

Returns

True or False

pythainlp.util.isthaichar(ch: str)bool[source]

Check if a character is Thai เป็นอักษรไทยหรือไม่

Parameters

ch (str) – input character

Returns

True or False

pythainlp.util.normalize(text: str)str[source]

Thai text normalize

Parameters

text (str) – thai text

Returns

thai text

Example::
>>> print(normalize("เเปลก")=="แปลก") # เ เ ป ล ก กับ แปลก
True
pythainlp.util.now_reign_year()[source]
Returns

reign year for Rama X of Chakri dynasty

pythainlp.util.num_to_thaiword(number: int)str[source]
Parameters

number (int) – a float number (with decimals) indicating a quantity

Returns

a text that indicates the full amount in word form, properly ending each digit with the right term.

pythainlp.util.rank(words: List[str], exclude_stopwords: bool = False)collections.Counter[source]

Sort words by frequency

Parameters
  • words (list) – a list of words

  • exclude_stopwords (bool) – exclude stopwords

Returns

Counter

pythainlp.util.reign_year_to_ad(reign_year: int, reign: int)int[source]

Reign year of Chakri dynasty, Thailand

pythainlp.util.text_to_arabic_digit(text: str)str[source]
Parameters

text – A digit spelled out in Thai

Returns

An Arabic digit such as ‘1’, ‘2’, ‘3’

pythainlp.util.text_to_thai_digit(text: str)str[source]
Parameters

text – A digit spelled out in Thai

Returns

A Thai digit such as ‘๑’, ‘๒’, ‘๓’

pythainlp.util.thai_strftime(datetime: datetime.datetime, fmt: str, thaidigit: bool = False)str[source]

Thai date and time string formatter Formatting directives similar to datetime.strftime()

Will use Thai names and Thai Buddhist Era for these directives: - %a abbreviated weekday name - %A full weekday name - %b abbreviated month name - %B full month name - %y year without century - %Y year with century - %c date and time representation - %v short date representation (undocumented)

Other directives will be passed to datetime.strftime()

Note 1: The Thai Buddhist Era (BE) year is simply converted from AD by adding 543. This is certainly not accurate for years before 1941 AD, due to the change in Thai New Year’s Day.

Note 2: This meant to be an interrim solution, since Python standard’s locale module (which relied on C’s strftime()) does not support “th” or “th_TH” locale yet. If supported, we can just locale.setlocale(locale.LC_TIME, “th_TH”) and then use native datetime.strftime().

Note 3: We trying to make this platform-independent and support extentions as many as possible, See these links for strftime() extensions in POSIX, BSD, and GNU libc: - Python https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior - C http://www.cplusplus.com/reference/ctime/strftime/ - GNU https://metacpan.org/pod/POSIX::strftime::GNU - Linux https://linux.die.net/man/3/strftime - OpenBSD https://man.openbsd.org/strftime.3 - FreeBSD https://www.unix.com/man-page/FreeBSD/3/strftime/ - macOS https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/strftime.3.html - PHP https://secure.php.net/manual/en/function.strftime.php - JavaScript’s implementation https://github.com/samsonjs/strftime - strftime() quick reference http://www.strftime.net/

Returns

Date and time spelled out in text, with month in Thai name and year in Thai Buddhist era. The year is simply converted from AD by adding 543 (will not accurate for years before 1941 AD, due to change in Thai New Year’s Day).

pythainlp.util.thai_to_eng(text: str)str[source]

Correct text in one language that is incorrectly-typed with a keyboard layout in another language. (type Thai with English keyboard)

Parameters

text (str) – Incorrect input (type English with Thai keyboard)

Returns

English text

pythainlp.util.thai_digit_to_arabic_digit(text: str)str[source]
Parameters

text (str) – Text with Thai digits such as ‘๑’, ‘๒’, ‘๓’

Returns

Text with Thai digits being converted to Arabic digits such as ‘1’, ‘2’, ‘3’

pythainlp.util.thaiword_to_num(word: str)int[source]

Converts a Thai number spellout word to actual number value

Parameters

word (str) – a Thai number spellout

Returns

number

pythainlp.util.thaicheck(word: str)bool[source]

Check if a word is an “authentic Thai word”

Parameters

word (str) – word

Returns

True or False