pythainlp.util

The pythainlp.util contains utility functions, like text conversion and formatting

Modules

pythainlp.util.abbreviation_to_full_text(text: str, top_k: int = 2) List[Tuple[str, float | None]][source]

This function convert Thai text (with abbreviation) to full text.

This function use KhamYo for handles abbreviations. See more KhamYo.

Parameters:
  • text (str) – Thai text

  • top_k (int) – Top K

Returns:

Thai full text that handles abbreviations as full text and cos scores (original text - modified text).

Return type:

List[Tuple[str, Union[float, None]]]

Example:

from pythainlp.util import abbreviation_to_full_text

text = "รร.ของเราน่าอยู่"

abbreviation_to_full_text(text)
# output: [
# ('โรงเรียนของเราน่าอยู่', tensor(0.3734)), 
# ('โรงแรมของเราน่าอยู่', tensor(0.2438))
# ]
pythainlp.util.arabic_digit_to_thai_digit(text: str) str[source]

This function convert Arabic digits (i.e. 1, 3, 10) to Thai digits (i.e. ๑, ๓, ๑๐).

Parameters:

text (str) – Text with Arabic digits such as ‘1’, ‘2’, ‘3’

Returns:

Text with Arabic digits being converted to Thai digits such as ‘๑’, ‘๒’, ‘๓’

Return type:

str

Example:

from pythainlp.util import arabic_digit_to_thai_digit

text = 'เป็นจำนวน 123,400.25 บาท'

arabic_digit_to_thai_digit(text)
# output: เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท
pythainlp.util.bahttext(number: float) str[source]

This function converts a number to Thai text and adds a suffix “บาท” (Baht). The precision will be fixed at two decimal places (0.00) to fits “สตางค์” (Satang) unit. This function works similar to BAHTTEXT function in Microsoft Excel.

Parameters:

number (float) – number to be converted into Thai Baht currency format

Returns:

text representing the amount of money in the format of Thai currency

Return type:

str

Example:

from pythainlp.util import bahttext

bahttext(1)
# output: หนึ่งบาทถ้วน

bahttext(21)
# output: ยี่สิบเอ็ดบาทถ้วน

bahttext(200)
# output: สองร้อยบาทถ้วน
pythainlp.util.convert_years(year: str, src='be', target='ad') str[source]

Convert years

Parameters:
  • year (int) – year

  • src (str) – The src year

  • target (str) – The target year

Returns:

The years that be convert

Return type:

str

Options for year
  • be - Buddhist calendar

  • ad - Anno Domini

  • re - Rattanakosin era

  • ah - Anno Hejira

Warning: This function works properly only after 1941 because Thailand has change the Thai calendar in 1941. If you are the time traveler or the historian, you should care about the correct calendar.

pythainlp.util.collate(data: Iterable, reverse: bool = False) List[str][source]

This function sorts strings (almost) according to Thai dictionary.

Important notes: this implementation ignores tone marks and symbols

Parameters:
  • data (Iterable) – a list of words to be sorted

  • reverse (bool, optional) – If reverse is set to True the result will be sorted in descending order. Otherwise, the result will be sorted in ascending order, defaults to False

Returns:

a list of strings, sorted alphabetically, (almost) according to Thai dictionary

Return type:

List[str]

Example:

from pythainlp.util import collate

collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'])
# output: ['กาล', 'เกิด', 'ไก่', 'เป็ด', 'วันที่', 'วัว', 'หมู']

collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'], \
    reverse=True)
# output: ['หมู', 'วัว', 'วันที่', 'เป็ด', 'ไก่', 'เกิด', 'กาล']
pythainlp.util.count_thai_chars(text: str) dict[source]

Count Thai characters by type

This function will give you numbers of Thai characters by type (consonants, vowels, lead_vowels, follow_vowels, above_vowels, below_vowels, tonemarks, signs, thai_digits, punctuations, non_thai)

Parameters:

text (str) – Text

Returns:

Dict with numbers of Thai characters by type

Return type:

dict

Example:

from pythainlp.util import count_thai_chars

count_thai_chars("ทดสอบภาษาไทย")
# output: {
# 'vowels': 3,
# 'lead_vowels': 1,
# 'follow_vowels': 2,
# 'above_vowels': 0,
# 'below_vowels': 0,
# 'consonants': 9,
# 'tonemarks': 0,
# 'signs': 0,
# 'thai_digits': 0,
# 'punctuations': 0,
# 'non_thai': 0
# }
pythainlp.util.countthai(text: str, ignore_chars: str = ' \t\n\r\x0b\x0c0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') float[source]

Find proportion of Thai characters in a given text

Parameters:
  • text (str) – input text

  • ignore_chars (str, optional) – characters to be ignored, defaults to whitespaces,digits, and puntuations.

Returns:

proportion of Thai characters in the text (percent)

Return type:

float

Example:

from pythainlp.util import countthai

countthai("ไทยเอ็นแอลพี 3.0")
# output: 100.0

countthai("PyThaiNLP 3.0")
# output: 0.0

countthai("ใช้งาน PyThaiNLP 3.0")
# output: 40.0

countthai("ใช้งาน PyThaiNLP 3.0", ignore_chars="")
# output: 30.0
pythainlp.util.dict_trie(dict_source: str | Iterable[str] | Trie) Trie[source]

Create a dictionary trie from a file or an iterable.

Parameters:

dict_source (str|Iterable[str]|pythainlp.util.Trie) – a path to dictionary file or a list of words or a pythainlp.util.Trie object

Returns:

a trie object

Return type:

pythainlp.util.Trie

pythainlp.util.digit_to_text(text: str) str[source]
Parameters:

text (str) – Text with digits such as ‘1’, ‘2’, ‘๓’, ‘๔’

Returns:

Text with digits being spelled out in Thai

pythainlp.util.display_thai_char(ch: str) str[source]

Prefix an underscore (_) to a high-position vowel or a tone mark, to ease readability.

Parameters:

ch (str) – input character

Returns:

“_” + ch

Return type:

str

Example:

from pythainlp.util import display_thai_char

display_thai_char("้")
# output: "_้"
pythainlp.util.emoji_to_thai(text: str, delimiters=(':', ':')) str[source]

This function convert emoji to thai meaning

Parameters:

text (str) – Text with Emoji

Returns:

Text with Emoji being converted to thai meaning

Return type:

str

Example:

from pythainlp.util import emoji_to_thai

emoji_to_thai("จะมานั่งรถเมล์เหมือนผมก็ได้นะครับ ใกล้ชิดประชาชนดี 😀")
# output: จะมานั่งรถเมล์เหมือนผมก็ได้นะครับ
          ใกล้ชิดประชาชนดี :หน้ายิ้มยิงฟัน:

emoji_to_thai("หิวข้าวอยากกินอาหารญี่ปุ่น 🍣")
# output: หิวข้าวอยากกินอาหารญี่ปุ่น :ซูชิ:

emoji_to_thai("🇹🇭 นี่คือธงประเทศไทย")
# output: :ธง_ไทย: นี่คือธงประเทศไทย
pythainlp.util.eng_to_thai(text: str) str[source]

Corrects the given text that was incorrectly typed using English-US Qwerty keyboard layout to the originally intended keyboard layout that is the Thai Kedmanee keyboard.

Parameters:

text (str) – incorrect text input (type Thai with English keyboard)

Returns:

Thai text where incorrect typing with a keyboard layout is corrected

Return type:

str

Example:

Intentionally type “ธนาคารแห่งประเทศไทย”, but got “Tok8kicsj’xitgmLwmp”:

from pythainlp.util import eng_to_thai

eng_to_thai("Tok8kicsj'xitgmLwmp")
# output: ธนาคารแห่งประเทศไทย
pythainlp.util.find_keyword(word_list: List[str], min_len: int = 3) Dict[str, int][source]

This function count the frequency of words in the list where stopword is excluded and returns as a frequency dictionary.

Parameters:
  • word_list (list) – a list of words

  • min_len (int) – the mininum frequency for words to obtain

Returns:

a dictionary object with key-value pair as word and its raw count

Return type:

dict[str, int]

Example:

from pythainlp.util import find_keyword

words = ["บันทึก", "เหตุการณ์", "บันทึก", "เหตุการณ์",
         " ", "มี", "การ", "บันทึก", "เป็น", " ", "ลายลักษณ์อักษร"
         "และ", "การ", "บันทึก","เสียง","ใน","เหตุการณ์"]

find_keyword(words)
# output: {'บันทึก': 4, 'เหตุการณ์': 3}

find_keyword(words, min_len=1)
# output: {' ': 2, 'บันทึก': 4, 'ลายลักษณ์อักษรและ': 1,
 'เสียง': 1, 'เหตุการณ์': 3}
pythainlp.util.ipa_to_rtgs(ipa: str) str[source]

Converter IPA system to The Royal Thai General System of Transcription (RTGS)

Docs: https://en.wikipedia.org/wiki/Help:IPA/Thai

Parameters:

ipa (str) – IPA phoneme

Returns:

The RTGS that be convert

Return type:

str

Example:

from pythainlp.util import ipa_to_rtgs

print(ipa_to_rtgs("kluaj"))
# output : 'kluai'
pythainlp.util.is_native_thai(word: str) bool[source]

Check if a word is an “native Thai word” (Thai: “คำไทยแท้”) This function based on a simple heuristic algorithm and cannot be entirely reliable.

Parameters:

word (str) – word

Returns:

True or False

Return type:

bool

Example:

English word:

from pythainlp.util import is_native_thai

is_native_thai("Avocado")
# output: False

Native Thai word:

is_native_thai("มะม่วง")
# output: True
is_native_thai("ตะวัน")
# output: True

Non-native Thai word:

is_native_thai("สามารถ")
# output: False
is_native_thai("อิสริยาภรณ์")
# output: False
pythainlp.util.isthai(text: str, ignore_chars: str = '.') bool[source]

Check if every characters in a string are Thai character.

Parameters:
  • text (str) – input text

  • ignore_chars (str, optional) – characters to be ignored, defaults to “.”

Returns:

True if every characters in the input string are Thai, otherwise False.

Return type:

bool

Example:

from pythainlp.util import isthai

isthai("กาลเวลา")
# output: True

isthai("กาลเวลา.")
# output: True

isthai("กาล-เวลา")
# output: False

isthai("กาล-เวลา +66", ignore_chars="01234567890+-.,")
# output: True
pythainlp.util.isthaichar(ch: str) bool[source]

Check if a character is a Thai character.

Parameters:

ch (str) – input character

Returns:

True if ch is a Thai characttr, otherwise False.

Return type:

bool

Example:

from pythainlp.util import isthaichar

isthaichar("ก")  # THAI CHARACTER KO KAI
# output: True

isthaichar("๕")  # THAI DIGIT FIVE
# output: True
pythainlp.util.maiyamok(sent: str | List[str]) List[str][source]

Thai MaiYaMok

MaiYaMok (ๆ) is the mark of duplicate word in Thai language. This function is preprocessing MaiYaMok in Thai sentence.

Parameters:

sent (Union[str, List[str]]) – input sentence (list or str)

Returns:

List of words

Return type:

List[str]

Example:

from pythainlp.util import maiyamok

maiyamok("เด็กๆชอบไปโรงเรียน")
# output: ['เด็ก', 'เด็ก', 'ชอบ', 'ไป', 'โรงเรียน']

maiyamok(["ทำไม","คน","ดี"," ","ๆ","ๆ"," ","ถึง","ทำ","ไม่ได้"])
# output: ['ทำไม', 'คน', 'ดี', 'ดี', 'ดี', ' ', 'ถึง', 'ทำ', 'ไม่ได้']
pythainlp.util.nectec_to_ipa(pronunciation: str) str[source]

Converter NECTEC system to IPA system

Parameters:

pronunciation (str) – NECTEC phoneme

Returns:

IPA that be convert

Return type:

str

Example:

from pythainlp.util import nectec_to_ipa

print(nectec_to_ipa("kl-uua-j^-2"))
# output : 'kl uua j ˥˩'

References

Pornpimon Palingoon, Sumonmas Thatphithakkul. Chapter 4 Speech processing and Speech corpus. In: Handbook of Thai Electronic Corpus. 1st ed. p. 122–56.

pythainlp.util.normalize(text: str) str[source]

Normalize and clean Thai text with normalizing rules as follows:

  • Remove zero-width spaces

  • Remove duplicate spaces

  • Reorder tone marks and vowels to standard order/spelling

  • Remove duplicate vowels and signs

  • Remove duplicate tone marks

  • Remove dangling non-base characters at the beginning of text

normalize() simply call remove_zw(), remove_dup_spaces(), remove_repeat_vowels(), and remove_dangling(), in that order.

If a user wants to customize the selection or the order of rules to be applied, they can choose to call those functions by themselves.

Note: for Unicode normalization, see unicodedata.normalize().

Parameters:

text (str) – input text

Returns:

normalized text according to the fules

Return type:

str

Example:

from pythainlp.util import normalize

normalize('เเปลก')  # starts with two Sara E
# output: แปลก

normalize('นานาาา')
# output: นานา
pythainlp.util.now_reign_year() int[source]

Return the reign year of the 10th King of Chakri dynasty.

Returns:

reign year of the 10th King of Chakri dynasty.

Return type:

int

Example:

from pythainlp.util import now_reign_year

text = "เป็นปีที่ {reign_year} ในรัชกาลปัจจุบัน"\
    .format(reign_year=now_reign_year())

print(text)
# output: เป็นปีที่ 4 ในรัชการปัจจุบัน
pythainlp.util.num_to_thaiword(number: int) str[source]

This function convert number to Thai text

Parameters:

number (int) – an integer number to be converted to Thai text

Returns:

text representing the number in Thai

Return type:

str

Example:

from pythainlp.util import num_to_thaiword

num_to_thaiword(1)
# output: หนึ่ง

num_to_thaiword(11)
# output: สิบเอ็ด
pythainlp.util.rank(words: List[str], exclude_stopwords: bool = False) Counter[source]

Count word frequecy given a list of Thai words with an option to exclude stopwords.

Parameters:
  • words (list) – a list of words

  • exclude_stopwords (bool) – If this parameter is set to True to exclude stopwords from counting. Otherwise, the stopwords will be counted. By default, `exclude_stopwords`is set to False

Returns:

a Counter object representing word frequency from the text

Return type:

collections.Counter

Example:

Include stopwords in counting word frequency:

from pythainlp.util import rank

words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", \
"เป็น", " ", "ลายลักษณ์อักษร"]

rank(words)
# output:
# Counter(
#     {
#         ' ': 2,
#         'การ': 1,
#         'บันทึก': 2,
#         'มี': 1,
#         'ลายลักษณ์อักษร': 1,
#         'เป็น': 1,
#         'เหตุการณ์': 1
#     })

Exclude stopword in counting word frequency:

from pythainlp.util import rank

words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", \
    "เป็น", " ", "ลายลักษณ์อักษร"]

rank(words)
# output:
# Counter(
#     {
#         ' ': 2,
#         'บันทึก': 2,
#         'ลายลักษณ์อักษร': 1,
#         'เหตุการณ์': 1
#     })
pythainlp.util.reign_year_to_ad(reign_year: int, reign: int) int[source]

Convert reigh year to AD.

Return AD year according to the reign year for the 7th to 10th King of Chakri dynasty, Thailand. For instance, the AD year of the 4th reign year of the 10th King is 2019.

Parameters:
  • reign_year (int) – reign year of the King

  • reign (int) – the reign of the King (i.e. 7, 8, 9, and 10)

Returns:

the year in AD of the King given the reign and reign year.

Return type:

int

Example:

from pythainlp.util import reign_year_to_ad

print("The 4th reign year of the King Rama X is in", \
    reign_year_to_ad(4, 10))
# output: The 4th reign year of the King Rama X is in 2019

print("The 1st reign year of the King Rama IX is in", \
    reign_year_to_ad(1, 9))
# output: The 4th reign year of the King Rama X is in 1946
pythainlp.util.remove_dangling(text: str) str[source]

Remove Thai non-base characters at the beginning of text.

This is a common “typo”, especially for input field in a form, as these non-base characters can be visually hidden from user who may accidentally typed them in.

A character to be removed should be both:

  • tone mark, above vowel, below vowel, or non-base sign AND

  • located at the beginning of the text

Parameters:

text (str) – input text

Returns:

text without dangling Thai characters at the beginning

Return type:

str

Example:

from pythainlp.util import remove_dangling

remove_dangling('๊ก')
# output: 'ก'
pythainlp.util.remove_dup_spaces(text: str) str[source]

Remove duplicate spaces. Replace multiple spaces with one space.

Multiple newline characters and empty lines will be replaced with one newline character.

Parameters:

text (str) – input text

Returns:

text without duplicated spaces and newlines

Return type:

str

Example:

from pythainlp.util import remove_dup_spaces

remove_dup_spaces('ก    ข    ค')
# output: 'ก ข ค'
pythainlp.util.remove_repeat_vowels(text: str) str[source]

Remove repeating vowels, tone marks, and signs.

This function will call reorder_vowels() first, to make sure that double Sara E will be converted to Sara Ae and not be removed.

Parameters:

text (str) – input text

Returns:

text without repeating Thai vowels, tone marks, and signs

Return type:

str

pythainlp.util.remove_tone_ipa(ipa: str) str[source]

Remove Thai Tone from IPA system

Parameters:

ipa (str) – IPA phoneme

Returns:

IPA phoneme that deleted tone

Return type:

str

Example:

from pythainlp.util import remove_tone_ipa

print(remove_tone_ipa("laː˦˥.sa˨˩.maj˩˩˦"))
# output : laː.sa.maj
pythainlp.util.remove_tonemark(text: str) str[source]

Remove all Thai tone marks from the text.

Thai script has four tone marks indicating four tones as follows:

  • Down tone (Thai: ไม้เอก _่ )

  • Falling tone (Thai: ไม้โท _้ )

  • High tone (Thai: ไม้ตรี ​_๊ )

  • Rising tone (Thai: ไม้จัตวา _๋ )

Putting wrong tone mark is a common mistake in Thai writing. By removing tone marks from the string, it could be used to for a approximate string matching

Parameters:

text (str) – input text

Returns:

text without Thai tone marks

Return type:

str

Example:

from pythainlp.util import remove_tonemark

remove_tonemark('สองพันหนึ่งร้อยสี่สิบเจ็ดล้านสี่แสนแปดหมื่นสามพันหกร้อยสี่สิบเจ็ด')
# output: สองพันหนึงรอยสีสิบเจ็ดลานสีแสนแปดหมืนสามพันหกรอยสีสิบเจ็ด
pythainlp.util.remove_zw(text: str) str[source]

Remove zero-width characters.

These non-visible characters may cause unexpected result from the user’s point of view. Removing them can make string matching more robust.

Characters to be removed:

  • Zero-width space (ZWSP)

  • Zero-width non-joiner (ZWJP)

Parameters:

text (str) – input text

Returns:

text without zero-width characters

Return type:

str

pythainlp.util.reorder_vowels(text: str) str[source]

Reorder vowels and tone marks to the standard logical order/spelling.

Characters in input text will be reordered/transformed, according to these rules:

  • Sara E + Sara E -> Sara Ae

  • Nikhahit + Sara Aa -> Sara Am

  • tone mark + non-base vowel -> non-base vowel + tone mark

  • follow vowel + tone mark -> tone mark + follow vowel

Parameters:

text (str) – input text

Returns:

text with vowels and tone marks in the standard logical order

Return type:

str

pythainlp.util.sound_syllable(syllable: str) str[source]

Sound syllable classification

This function is sound syllable classification. It is live syllable or dead syllable.

Parameters:

syllable (str) – Thai syllable

Returns:

syllable’s type (live or dead)

Return type:

str

Example:

from pythainlp.util import sound_syllable

print(sound_syllable("มา"))
# output: live

print(sound_syllable("เลข"))
# output: dead
pythainlp.util.syllable_length(syllable: str) str[source]

Thai syllable length

This function is use for find syllable’s length. (long or short)

Parameters:

syllable (str) – Thai syllable

Returns:

syllable’s length (long or short)

Return type:

str

Example:

from pythainlp.util import syllable_length

print(syllable_length("มาก"))
# output: long

print(syllable_length("คะ"))
# output: short
pythainlp.util.syllable_open_close_detector(syllable: str) str[source]

Thai syllable open/close detector

This function is use for find Thai syllable that open or closed sound.

Parameters:

syllable (str) – Thai syllable

Returns:

open / close

Return type:

str

Example:

from pythainlp.util import syllable_open_close_detector

print(syllable_open_close_detector("มาก"))
# output: close

print(syllable_open_close_detector("คะ"))
# output: open
pythainlp.util.text_to_arabic_digit(text: str) str[source]

This function convert Thai spelled out digits to Arabic digits.

Parameters:

text – A digit spelled out in Thai

Returns:

An Arabic digit such as ‘1’, ‘2’, ‘3’ if the text is Thai digit spelled out (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.

Return type:

str

Example:

from pythainlp.util import text_to_arabic_digit

text_to_arabic_digit("ศูนย์")
# output: 0
text_to_arabic_digit("หนึ่ง")
# output: 1
text_to_arabic_digit("แปด")
# output: 8
text_to_arabic_digit("เก้า")
# output: 9

# For text that is not Thai digit spelled out
text_to_arabic_digit("สิบ") == ""
# output: True
text_to_arabic_digit("เก้าร้อย") == ""
# output: True
pythainlp.util.text_to_num(text: str) List[str][source]

Thai text to list thai word with floating point number

Parameters:

text (str) – Thai text with the spelled-out numerals

Returns:

list of thai words with float value of the input

Return type:

List[str]

Example:

from pythainlp.util import text_to_num

text_to_num("เก้าร้อยแปดสิบจุดเก้าห้าบาทนี่คือจำนวนทั้งหมด")
# output: ['980.95', 'บาท', 'นี่', 'คือ', 'จำนวน', 'ทั้งหมด']

text_to_num("สิบล้านสองหมื่นหนึ่งพันแปดร้อยแปดสิบเก้าบาท")
# output: ['10021889', 'บาท']
pythainlp.util.text_to_thai_digit(text: str) str[source]

This function convert Thai spelled out digits to Thai digits.

Parameters:

text – A digit spelled out in Thai

Returns:

A Thai digit such as ‘๑’, ‘๒’, ‘๓’ if the text is Thai digit spelled out (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.

Return type:

str

Example:

from pythainlp.util import text_to_thai_digit

text_to_thai_digit("ศูนย์")
# output: ๐
text_to_thai_digit("หนึ่ง")
# output: ๑
text_to_thai_digit("แปด")
# output: ๘
text_to_thai_digit("เก้า")
# output: ๙

# For text that is not Thai digit spelled out
text_to_thai_digit("สิบ") == ""
# output: True
text_to_thai_digit("เก้าร้อย") == ""
# output: True
pythainlp.util.thai_digit_to_arabic_digit(text: str) str[source]

This function convert Thai digits (i.e. ๑, ๓, ๑๐) to Arabic digits (i.e. 1, 3, 10).

Parameters:

text (str) – Text with Thai digits such as ‘๑’, ‘๒’, ‘๓’

Returns:

Text with Thai digits being converted to Arabic digits such as ‘1’, ‘2’, ‘3’

Return type:

str

Example:

from pythainlp.util import thai_digit_to_arabic_digit

text = 'เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท'

thai_digit_to_arabic_digit(text)
# output: เป็นจำนวน 123,400.25 บาท
pythainlp.util.thai_strftime(dt_obj: datetime, fmt: str = '%-d %b %y', thaidigit: bool = False) str[source]

Convert datetime.datetime into Thai date and time format.

The formatting directives are similar to datatime.strrftime().

This function uses Thai names and Thai Buddhist Era for these directives:
  • %a - abbreviated weekday name (i.e. “จ”, “อ”, “พ”, “พฤ”, “ศ”, “ส”, “อา”)

  • %A - full weekday name (i.e. “วันจันทร์”, “วันอังคาร”, “วันเสาร์”, “วันอาทิตย์”)

  • %b - abbreviated month name (i.e. “ม.ค.”,”ก.พ.”,”มี.ค.”,”เม.ย.”,”พ.ค.”,”มิ.ย.”, “ธ.ค.”)

  • %B - full month name (i.e. “มกราคม”, “กุมภาพันธ์”, “พฤศจิกายน”, “ธันวาคม”,)

  • %y - year without century (i.e. “56”, “10”)

  • %Y - year with century (i.e. “2556”, “2410”)

  • %c - date and time representation (i.e. “พ 6 ต.ค. 01:40:00 2519”)

  • %v - short date representation (i.e. ” 6-ม.ค.-2562”, “27-ก.พ.-2555”)

Other directives will be passed to datetime.strftime()

Note:
  • The Thai Buddhist Era (BE) year is simply converted from AD by adding 543. This is certainly not accurate for years before 1941 AD, due to the change in Thai New Year’s Day.

  • This meant to be an interrim solution, since Python standard’s locale module (which relied on C’s strftime()) does not support “th” or “th_TH” locale yet. If supported, we can just locale.setlocale(locale.LC_TIME, “th_TH”) and then use native datetime.strftime().

We trying to make this platform-independent and support extentions as many as possible. See these links for strftime() extensions in POSIX, BSD, and GNU libc:

Parameters:
  • dt_obj (datetime) – an instantiatetd object of datetime.datetime

  • fmt (str) – string containing date and time directives

  • thaidigit (bool) – If thaidigit is set to False (default), number will be represented in Arabic digit. If it is set to True, it will be represented in Thai digit.

Returns:

Date and time text, with month in Thai name and year in Thai Buddhist era. The year is simply converted from AD by adding 543 (will not accurate for years before 1941 AD, due to change in Thai New Year’s Day).

Return type:

str

Example:

from datetime import datetime
from pythainlp.util import thai_strftime

datetime_obj = datetime(year=2019, month=6, day=9, \
    hour=5, minute=59, second=0, microsecond=0)

print(datetime_obj)
# output: 2019-06-09 05:59:00

thai_strftime(datetime_obj, "%A %d %B %Y")
# output: 'วันอาทิตย์ 09 มิถุนายน 2562'

thai_strftime(datetime_obj, "%a %-d %b %y")  # no padding
# output: 'อา 9 มิ.ย. 62'

thai_strftime(datetime_obj, "%a %_d %b %y")  # space padding
# output: 'อา  9 มิ.ย. 62'

thai_strftime(datetime_obj, "%a %0d %b %y")  # zero padding
# output: 'อา 09 มิ.ย. 62'

thai_strftime(datetime_obj, "%-H นาฬิกา %-M นาที", thaidigit=True)
# output: '๕ นาฬิกา ๕๙ นาที'

thai_strftime(datetime_obj, "%D (%v)")
# output: '06/09/62 ( 9-มิ.ย.-2562)'

thai_strftime(datetime_obj, "%c")
# output: 'อา  9 มิ.ย. 05:59:00 2562'

thai_strftime(datetime_obj, "%H:%M %p")
# output: '01:40 AM'

thai_strftime(datetime_obj, "%H:%M %#p")
# output: '01:40 am'
pythainlp.util.thai_strptime(text: str, fmt: str, year: str = 'be', add_year: int | None = None, tzinfo=backports.zoneinfo.ZoneInfo(key='Asia/Bangkok'))[source]

Thai strptime

Parameters:
  • text (str) – text

  • fmt (str) – string containing date and time directives

  • year (str) – year of the text (ad isAnno Domini and be is Buddhist calendar)

  • add_year (int) – add year convert to ad

  • tzinfo (object) – tzinfo (default is Asia/Bangkok)

Returns:

The years that be convert to datetime.datetime

Return type:

datetime.datetime

The fmt char that support:
  • %d - Day (1 - 31)

  • %B - Thai month (03, 3, มี.ค., or มีนาคม)

  • %Y - Year (66, 2566, or 2023)

  • %H - Hour (0 - 23)

  • %M - Minute (0 - 59)

  • %S - Second (0 - 59)

  • %f - Microsecond

Example:

from pythainlp.util import thai_strptime

thai_strptime("15 ก.ค. 2565 09:00:01","%d %B %Y %H:%M:%S")
# output:
# datetime.datetime(
#   2022,
#   7,
#   15,
#   9,
#   0,
#   1,
#   tzinfo=backports.zoneinfo.ZoneInfo(key='Asia/Bangkok')
# )
pythainlp.util.thai_to_eng(text: str) str[source]

Corrects the given text that was incorrectly typed using Thai Kedmanee keyboard layout to the originally intended keyboard layout that is the English-US Qwerty keyboard.

Parameters:

text (str) – incorrect text input (type English with Thai keyboard)

Returns:

English text where incorrect typing with a keyboard layout is corrected

Return type:

str

Example:

Intentionally type “Bank of Thailand”, but got “ฺฟืา นด ธ้ฟรสฟืก”:

from pythainlp.util import eng_to_thai

thai_to_eng("ฺฟืา นด ธ้ฟรสฟืก")
# output: 'Bank of Thailand'
pythainlp.util.thai_word_tone_detector(word: str) Tuple[str, str][source]

Thai tone detector for word.

It use pythainlp.transliterate.pronunciate for convert word to pronunciation.

Parameters:

word (str) – Thai word.

Returns:

Thai pronunciation with tone each syllables. (l, m, h, r, f or empty if it cannot detector)

Return type:

Tuple[str, str]

Example:

from pythainlp.util import thai_word_tone_detector

print(thai_word_tone_detector("คนดี"))
# output: [('คน', 'm'), ('ดี', 'm')]

print(thai_word_tone_detector("มือถือ"))
# output: [('มือ', 'm'), ('ถือ', 'r')]
pythainlp.util.thaiword_to_date(text: str, date: datetime | None = None) datetime | None[source]

Convert Thai relative date to datetime.datetime.

Parameters:
  • text (str) – Thai text contains relative date

  • date (datetime.datetime) – date (default is datetime.datetime.now())

Returns:

datetime object, if it can be calculated. Otherwise, None.

Return type:

datetime.datetime

Example:

thaiword_to_date(“พรุ่งนี้”) # output: # datetime of tomorrow

pythainlp.util.thaiword_to_num(word: str) int[source]

Converts the spelled-out numerals in Thai scripts into an actual integer.

Parameters:

word (str) – Spelled-out numerals in Thai scripts

Returns:

Corresponding integer value of the input

Return type:

int

Example:

from pythainlp.util import thaiword_to_num

thaiword_to_num("ศูนย์")
# output: 0

thaiword_to_num("สองล้านสามแสนหกร้อยสิบสอง")
# output: 2300612
pythainlp.util.thaiword_to_time(text: str, padding: bool = True) str[source]

Convert Thai time in words into time (H:M).

Parameters:
  • text (str) – Thai time in words

  • padding (bool) – Zero padding the hour if True

Returns:

time string

Return type:

str

Example:

thaiword_to_time"บ่ายโมงครึ่ง")
# output:
# 13:30
pythainlp.util.time_to_thaiword(time_data: time | datetime | str, fmt: str = '24h', precision: str | None = None) str[source]

Spell out time to Thai words.

Parameters:
  • time_data (str) – time input, can be a datetime.time object or a datetime.datetime object or a string (in H:M or H:M:S format, using 24-hour clock)

  • fmt (str) – time output format * 24h - 24-hour clock (default) * 6h - 6-hour clock * m6h - Modified 6-hour clock

  • precision (str) – precision of the spell out * m - always spell out to minute level * s - always spell out to second level * None - spell out only non-zero parts

Returns:

Time spell out in Thai words

Return type:

str

Example:

time_to_thaiword("8:17")
# output:
# แปดนาฬิกาสิบเจ็ดนาที

time_to_thaiword("8:17", "6h")
# output:
# สองโมงเช้าสิบเจ็ดนาที

time_to_thaiword("8:17", "m6h")
# output:
# แปดโมงสิบเจ็ดนาที

time_to_thaiword("18:30", fmt="m6h")
# output:
# หกโมงครึ่ง

time_to_thaiword(datetime.time(12, 3, 0))
# output:
# สิบสองนาฬิกาสามนาที

time_to_thaiword(datetime.time(12, 3, 0), precision="s")
# output:
# สิบสองนาฬิกาสามนาทีศูนย์วินาที
pythainlp.util.tis620_to_utf8(text: str) str[source]

Convert TIS-620 to UTF-8

Parameters:

text (str) – Text that use TIS-620 encoding

Returns:

Text that use UTF-8 encoding

Return type:

str

Example:

from pythainlp.util import tis620_to_utf8

tis620_to_utf8("¡ÃзÃǧÍصÊÒË¡ÃÃÁ")
# output: 'กระทรวงอุตสาหกรรม'
pythainlp.util.tone_detector(syllable: str) str[source]

Thai tone detector for syllables

Parameters:

syllable (str) – Thai syllable

Returns:

syllable’s tone (l, m, h, r, f or empty if it cannot detector)

Return type:

str

Example:

from pythainlp.util import tone_detector

print(tone_detector("มา"))
# output: m

print(tone_detector("ไม้"))
# output: h
pythainlp.util.words_to_num(words: list) float[source]

Thai Words to float

Parameters:

text (str) – Thai words

Returns:

float of words

Return type:

float

Example:

from pythainlp.util import words_to_num

words_to_num(["ห้า", "สิบ", "จุด", "เก้า", "ห้า"])
# output: 50.95
pythainlp.util.spell_words.spell_syllable(s: str) List[str][source]

Spell syllable by Thai word distribution form.

Parameters:

s (str) – Thai syllable only

Returns:

List of spell syllable

Return type:

List[str]

Example:

from pythainlp.util.spell_words import spell_syllable

print(spell_syllable("แมว"))
# output: ['มอ', 'วอ', 'แอ', 'แมว']
pythainlp.util.spell_words.spell_word(w: str) List[str][source]

Spell word by Thai word distribution form.

Parameters:

w (str) – Thai word only

Returns:

List of spell word

Return type:

List[str]

Example:

from pythainlp.util.spell_words import spell_word

print(spell_word("คนดี"))
# output: ['คอ', 'นอ', 'คน', 'ดอ', 'อี', 'ดี', 'คนดี']
class pythainlp.util.Trie(words: Iterable[str])[source]
class Node[source]
__init__()[source]
end
children
__init__(words: Iterable[str])[source]
add(word: str) None[source]

Add a word to the trie. Spaces in front of and following the word will be removed.

Parameters:

text (str) – a word

remove(word: str) None[source]

Remove a word from the trie. If the word is not found, do nothing.

Parameters:

text (str) – a word

prefixes(text: str) List[str][source]

List all possible words from first sequence of characters in a word.

Parameters:

text (str) – a word

Returns:

a list of possible words

Return type:

List[str]