pythainlp.corpus

The pythainlp.corpus module provides access to various Thai language corpora and resources that come bundled with PyThaiNLP. These resources are essential for natural language processing tasks in the Thai language.

Modules

countries

pythainlp.corpus.countries() FrozenSet[str][source]

Return a frozenset of country names in Thai such as “แคนาดา”, “โรมาเนีย”, “แอลจีเรีย”, and “ลาว”.

(See: dev/pythainlp/corpus/countries_th.txt)

return:

frozenset containing country names in Thai

rtype:

frozenset

find_synonym

get_corpus

pythainlp.corpus.get_corpus(filename: str, comments: bool = True) frozenset[source]

Read corpus data from file and return a frozenset.

Each line in the file will be a member of the set.

Whitespace stripped and empty values and duplicates removed.

If comments is False, any text at any position after the character ‘#’ in each line will be discarded.

Parameters:
  • filename (str) – filename of the corpus to be read

  • comments (bool) – keep comments

Returns:

frozenset consisting of lines in the file

Return type:

frozenset

Example:

from pythainlp.corpus import get_corpus

# input file (negations_th.txt):
# แต่
# ไม่

get_corpus("negations_th.txt")
# output:
# frozenset({'แต่', 'ไม่'})

# input file (ttc_freq.txt):
# ตัวบท<tab>10
# โดยนัยนี้<tab>1

get_corpus("ttc_freq.txt")
# output:
# frozenset({'โดยนัยนี้\t1',
#    'ตัวบท\t10',
#     ...})

# input file (icubrk_th.txt):
# # Thai Dictionary for ICU BreakIterator
# กก
# กกขนาก

get_corpus("icubrk_th.txt")
# output:
# frozenset({'กกขนาก',
#     '# Thai Dictionary for ICU BreakIterator',
#     'กก',
#     ...})

get_corpus("icubrk_th.txt", comments=False)
# output:
# frozenset({'กกขนาก',
#     'กก',
#     ...})

get_corpus_as_is

pythainlp.corpus.get_corpus_as_is(filename: str) list[source]

Read corpus data from file, as it is, and return a list.

Each line in the file will be a member of the list.

No modifications in member values and their orders.

If strip or comment removal is needed, use get_corpus() instead.

Parameters:

filename (str) – filename of the corpus to be read

Returns:

list consisting of lines in the file

Return type:

list

Example:

from pythainlp.corpus import get_corpus

# input file (negations_th.txt):
# แต่
# ไม่

get_corpus_as_is("negations_th.txt")
# output:
# ['แต่', 'ไม่']

get_corpus_db

pythainlp.corpus.get_corpus_db(url: str)[source]

Get corpus catalog from server.

Parameters:

url (str) – URL corpus catalog

get_corpus_db_detail

pythainlp.corpus.get_corpus_db_detail(name: str, version: str = '') dict[source]

Get details about a corpus, using information from local catalog.

Parameters:

name (str) – name of corpus

Returns:

details about corpus

Return type:

dict

get_corpus_default_db

pythainlp.corpus.get_corpus_default_db(name: str, version: str = '') str | None[source]

Get model path from default_db.json

Parameters:

name (str) – corpus name

Returns:

path to the corpus or None if the corpus doesn’t exist on the device

Return type:

str

If you want to edit default_db.json, you can edit pythainlp/corpus/default_db.json

get_corpus_path

pythainlp.corpus.get_corpus_path(name: str, version: str = '', force: bool = False) str | None[source]

Get corpus path.

Parameters:
  • name (str) – corpus name

  • version (str) – version

  • force (bool) – force downloading

Returns:

path to the corpus or None if the corpus doesn’t exist on the device

Return type:

str

Example:

(Please see the filename in this file

If the corpus already exists:

from pythainlp.corpus import get_corpus_path

print(get_corpus_path('ttc'))
# output: /root/pythainlp-data/ttc_freq.txt

If the corpus has not been downloaded yet:

from pythainlp.corpus import download, get_corpus_path

print(get_corpus_path('wiki_lm_lstm'))
# output: None

download('wiki_lm_lstm')
# output:
# Download: wiki_lm_lstm
# wiki_lm_lstm 0.32
# thwiki_lm.pth?dl=1: 1.05GB [00:25, 41.5MB/s]
# /root/pythainlp-data/thwiki_model_lstm.pth

print(get_corpus_path('wiki_lm_lstm'))
# output: /root/pythainlp-data/thwiki_model_lstm.pth

download

pythainlp.corpus.download(name: str, force: bool = False, url: str = '', version: str = '') bool[source]

Download corpus.

The available corpus names can be seen in this file: https://pythainlp.github.io/pythainlp-corpus/db.json

Parameters:
  • name (str) – corpus name

  • force (bool) – force downloading

  • url (str) – URL of the corpus catalog

  • version (str) – version of the corpus

Returns:

True if the corpus is found and successfully downloaded. Otherwise, it returns False.

Return type:

bool

Example:

from pythainlp.corpus import download

download('wiki_lm_lstm', force=True)
# output:
# Corpus: wiki_lm_lstm
# - Downloading: wiki_lm_lstm 0.1
# thwiki_lm.pth:  26%|██▌       | 114k/434k [00:00<00:00, 690kB/s]

By default, downloaded corpora and models will be saved in $HOME/pythainlp-data/ (e.g. /Users/bact/pythainlp-data/wiki_lm_lstm.pth).

remove

pythainlp.corpus.remove(name: str) bool[source]

Remove corpus

Parameters:

name (str) – corpus name

Returns:

True if the corpus is found and successfully removed. Otherwise, it returns False.

Return type:

bool

Example:

from pythainlp.corpus import remove, get_corpus_path, get_corpus

print(remove('ttc'))
# output: True

print(get_corpus_path('ttc'))
# output: None

get_corpus('ttc')
# output:
# FileNotFoundError: [Errno 2] No such file or directory:
# '/usr/local/lib/python3.6/dist-packages/pythainlp/corpus/ttc'

provinces

pythainlp.corpus.provinces(details: bool = False) FrozenSet[str] | List[dict][source]

Return a frozenset of Thailand province names in Thai such as “กระบี่”, “กรุงเทพมหานคร”, “กาญจนบุรี”, and “อุบลราชธานี”.

(See: dev/pythainlp/corpus/thailand_provinces_th.txt)

param bool details:

return details of provinces or not

return:

frozenset containing province names of Thailand (if details is False) or list containing dict of province names and details such as [{‘name_th’: ‘นนทบุรี’, ‘abbr_th’: ‘นบ’, ‘name_en’: ‘Nonthaburi’, ‘abbr_en’: ‘NBI’}].

rtype:

frozenset or list

thai_dict

pythainlp.corpus.thai_dict() dict[source]

Return Thai dictionary with definition from wiktionary.

(See: thai_dict)

return:

Thai words with part-of-speech type and definition

rtype:

dict

thai_stopwords

pythainlp.corpus.thai_stopwords() FrozenSet[str][source]

Return a frozenset of Thai stopwords such as “มี”, “ไป”, “ไง”, “ขณะ”, “การ”, and “ประการหนึ่ง”.

(See: dev/pythainlp/corpus/stopwords_th.txt)

We use stopword lists by thesis’s เพ็ญศิริ ลี้ตระกูล.

See Also:

เพ็ญศิริ ลี้ตระกูล . การเลือกประโยคสำคัญในการสรุปความภาษาไทยโดยใช้แบบจำลองแบบลำดับชั้น. กรุงเทพมหานคร : มหาวิทยาลัยธรรมศาสตร์; 2551.

return:

frozenset containing stopwords.

rtype:

frozenset

thai_words

pythainlp.corpus.thai_words() FrozenSet[str][source]

Return a frozenset of Thai words such as “กติกา”, “กดดัน”, “พิษ”, and “พิษภัย”.

(See: dev/pythainlp/corpus/words_th.txt)

return:

frozenset containing words in the Thai language.

rtype:

frozenset

thai_wsd_dict

pythainlp.corpus.thai_wsd_dict() dict[source]

Return Thai Word Sense Disambiguation dictionary with definition from wiktionary.

(See: thai_dict)

return:

Thai words with part-of-speech type and definition

rtype:

dict

thai_orst_words

pythainlp.corpus.thai_orst_words() FrozenSet[str][source]

Return a frozenset of Thai words from Royal Society of Thailand

(See: dev/pythainlp/corpus/thai_orst_words.txt)

return:

frozenset containing words in the Thai language.

rtype:

frozenset

thai_synonyms

pythainlp.corpus.thai_synonyms() dict[source]

Return Thai synonyms.

(See: thai_synonym)

return:

Thai words with part-of-speech type and synonym

rtype:

dict

thai_syllables

pythainlp.corpus.thai_syllables() FrozenSet[str][source]

Return a frozenset of Thai syllables such as “กรอบ”, “ก็”, “๑”, “โมบ”, “โมน”, “โม่ง”, “กา”, “ก่า”, and, “ก้า”.

(See: dev/pythainlp/corpus/syllables_th.txt)

We use the Thai syllable list from KUCut.

return:

frozenset containing syllables in the Thai language.

rtype:

frozenset

thai_negations

pythainlp.corpus.thai_negations() FrozenSet[str][source]

Return a frozenset of Thai negation words including “ไม่” and “แต่”.

(See: dev/pythainlp/corpus/negations_th.txt)

return:

frozenset containing negations in the Thai language.

rtype:

frozenset

thai_family_names

pythainlp.corpus.thai_family_names() FrozenSet[str][source]

Return a frozenset of Thai family names

(See: dev/pythainlp/corpus/family_names_th.txt)

return:

frozenset containing Thai family names.

rtype:

frozenset

thai_female_names

pythainlp.corpus.thai_female_names() FrozenSet[str][source]

Return a frozenset of Thai female names

(See: dev/pythainlp/corpus/person_names_female_th.txt)

return:

frozenset containing Thai female names.

rtype:

frozenset

thai_male_names

pythainlp.corpus.thai_male_names() FrozenSet[str][source]

Return a frozenset of Thai male names

(See: dev/pythainlp/corpus/person_names_male_th.txt)

return:

frozenset containing Thai male names.

rtype:

frozenset

pythainlp.corpus.th_en_translit.get_transliteration_dict

pythainlp.corpus.th_en_translit.get_transliteration_dict() defaultdict[source]

Get Thai to English transliteration dictionary.

The returned dict is in defaultdict[str, defaultdict[List[str], List[Optional[bool]]]] format.

ConceptNet

ConceptNet is an open, multilingual knowledge graph used for various natural language understanding tasks. For more information, refer to the ConceptNet documentation.

pythainlp.corpus.conceptnet.edges

pythainlp.corpus.conceptnet.edges(word: str, lang: str = 'th')[source]

Get edges from ConceptNet API. ConceptNet is a public semantic network, designed to help computers understand the meanings of words that people use.

For example, the term “ConceptNet” is a “knowledge graph”, and “knowledge graph” has “common sense knowledge” which is a part of “artificial intelligence”. Also, “ConcepNet” is used for “natural language understanding” which is a part of “artificial intelligence”.

“ConceptNet” –is a–> “knowledge graph” –has–> “common sense” –a part of–> “artificial intelligence”
“ConceptNet” –used for–> “natural language understanding” –a part of–> “artificial intelligence”

With this illustration, it shows relationships (represented as Edge) between the terms (represented as Node)

Parameters:
  • word (str) – word to be sent to ConceptNet API

  • lang (str) – abbreviation of language (i.e. th for Thai, en for English, or ja for Japan). By default, it is th (Thai).

Returns:

return edges of the given word according to the ConceptNet network.

Return type:

list[dict]

Example:

from pythainlp.corpus.conceptnet import edges

edges('hello', lang='en')
# output:
# [{
#   '@id': '/a/[/r/IsA/,/c/en/hello/,/c/en/greeting/]',
#   '@type': 'Edge',
#   'dataset': '/d/conceptnet/4/en',
#   'end': {'@id': '/c/en/greeting',
#   '@type': 'Node',
#   'label': 'greeting',
#   'language': 'en',
#   'term': '/c/en/greeting'},
#   'license': 'cc:by/4.0',
#   'rel': {'@id': '/r/IsA', '@type': 'Relation', 'label': 'IsA'},
#   'sources': [
#   {
#   '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/bmsacr/]',
#   '@type': 'Source',
#   'activity': '/s/activity/omcs/vote',
#   'contributor': '/s/contributor/omcs/bmsacr'
#   },
#   {
#     '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/test/]',
#     '@type': 'Source',
#     'activity': '/s/activity/omcs/vote',
#     'contributor': '/s/contributor/omcs/test'}
#   ],
#   'start': {'@id': '/c/en/hello',
#   '@type': 'Node',
#   'label': 'Hello',
#   'language': 'en',
#   'term': '/c/en/hello'},
#   'surfaceText': '[[Hello]] is a kind of [[greeting]]',
#   'weight': 3.4641016151377544
# }, ...]

edges('สวัสดี', lang='th')
# output:
# [{
#  '@id': '/a/[/r/RelatedTo/,/c/th/สวัสดี/n/,/c/en/prosperity/]',
#  '@type': 'Edge',
#  'dataset': '/d/wiktionary/en',
#  'end': {'@id': '/c/en/prosperity',
#  '@type': 'Node',
#  'label': 'prosperity',
#  'language': 'en',
#  'term': '/c/en/prosperity'},
#  'license': 'cc:by-sa/4.0',
#  'rel': {
#      '@id': '/r/RelatedTo', '@type': 'Relation',
#      'label': 'RelatedTo'},
#  'sources': [{
#  '@id': '/and/[/s/process/wikiparsec/2/,/s/resource/wiktionary/en/]',
#  '@type': 'Source',
#  'contributor': '/s/resource/wiktionary/en',
#  'process': '/s/process/wikiparsec/2'}],
#  'start': {'@id': '/c/th/สวัสดี/n',
#  '@type': 'Node',
#  'label': 'สวัสดี',
#  'language': 'th',
#  'sense_label': 'n',
#  'term': '/c/th/สวัสดี'},
#  'surfaceText': None,
#  'weight': 1.0
# }, ...]

TNC (Thai National Corpus) —

The Thai National Corpus (TNC) is a collection of text data in the Thai language. This module provides access to word frequency data from the TNC corpus.

pythainlp.corpus.tnc.word_freqs

pythainlp.corpus.tnc.word_freqs() List[Tuple[str, int]][source]

Get word frequency from Thai National Corpus (TNC)

(See: dev/pythainlp/corpus/tnc_freq.txt)

pythainlp.corpus.tnc.unigram_word_freqs

pythainlp.corpus.tnc.unigram_word_freqs() defaultdict[source]

Get unigram word frequency from Thai National Corpus (TNC)

pythainlp.corpus.tnc.bigram_word_freqs

pythainlp.corpus.tnc.bigram_word_freqs() defaultdict[source]

Get bigram word frequency from Thai National Corpus (TNC)

pythainlp.corpus.tnc.trigram_word_freqs

pythainlp.corpus.tnc.trigram_word_freqs() defaultdict[source]

Get trigram word frequency from Thai National Corpus (TNC)

TTC (Thai Textbook Corpus) —

The Thai Textbook Corpus (TTC) is a collection of Thai language text data, primarily sourced from textbooks.

pythainlp.corpus.ttc.word_freqs

pythainlp.corpus.ttc.word_freqs() List[Tuple[str, int]][source]

Get word frequency from Thai Textbook Corpus (TTC)

(See: dev/pythainlp/corpus/ttc_freq.txt)

pythainlp.corpus.ttc.unigram_word_freqs

pythainlp.corpus.ttc.unigram_word_freqs() defaultdict[source]

Get unigram word frequency from Thai Textbook Corpus (TTC)

OSCAR

OSCAR is a multilingual corpus that includes Thai text data. This module provides access to word frequency data from the OSCAR corpus.

pythainlp.corpus.oscar.word_freqs

pythainlp.corpus.oscar.word_freqs() List[Tuple[str, int]][source]

Get word frequency from OSCAR Corpus (words tokenized using ICU)

pythainlp.corpus.oscar.unigram_word_freqs

pythainlp.corpus.oscar.unigram_word_freqs() defaultdict[source]

Get unigram word frequency from OSCAR Corpus (words tokenized using ICU)

Util

Utilities for working with the corpus data.

pythainlp.corpus.util.find_badwords

pythainlp.corpus.util.find_badwords(tokenize: Callable[[str], List[str]], training_data: Iterable[Iterable[str]]) Set[str][source]

Find words that do not work well with the tokenize function for the provided training_data.

Parameters:
  • tokenize (Callable[[str], List[str]]) – a tokenize function

  • training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set

Returns:

words that are considered to make tokenize perform badly

Return type:

Set[str]

pythainlp.corpus.util.revise_wordset

pythainlp.corpus.util.revise_wordset(tokenize: Callable[[str], List[str]], orig_words: Iterable[str], training_data: Iterable[Iterable[str]]) Set[str][source]

Revise a set of words that could improve tokenization performance of a dictionary-based tokenize function.

orig_words will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.

Parameters:
  • tokenize (Callable[[str], List[str]]) – a tokenize function, can be any function that takes a string as input and returns a List[str]

  • orig_words (Iterable[str]) – words that used by the tokenize function, will be used as a base for revision

  • training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set

Returns:

words that are considered to make tokenize perform badly

Return type:

Set[str]

Example::

from pythainlp.corpus import thai_words
from pythainlp.corpus.util import revise_wordset
from pythainlp.tokenize.longest import segment

base_words = thai_words()
more_words = {
    "ถวิล อุดล", "ทองอินทร์ ภูริพัฒน์", "เตียง ศิริขันธ์", "จำลอง ดาวเรือง"
}
base_words = base_words.union(more_words)
dict_trie = Trie(wordlist)

tokenize = lambda text: segment(text, dict_trie)

training_data = [
    [str, str, str. ...],
    [str, str, str, str, ...],
    ...
]

revised_words = revise_wordset(tokenize, wordlist, training_data)

pythainlp.corpus.util.revise_newmm_default_wordset

pythainlp.corpus.util.revise_newmm_default_wordset(training_data: Iterable[Iterable[str]]) Set[str][source]

Revise a set of word that could improve tokenization performance of pythainlp.tokenize.newmm, a dictionary-based tokenizer and a default tokenizer for PyThaiNLP.

Words from pythainlp.corpus.thai_words() will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.

Parameters:

training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set

Returns:

words that are considered to make tokenize perform badly

Return type:

Set[str]

WordNet

PyThaiNLP API includes the WordNet module, which is an exact copy of NLTK’s WordNet API for the Thai language. WordNet is a lexical database for English and other languages.

For more details on WordNet, refer to the NLTK WordNet documentation.

pythainlp.corpus.wordnet.synsets

pythainlp.corpus.wordnet.synsets(word: str, pos: str | None = None, lang: str = 'tha')[source]

This function returns the synonym set for all lemmas of the given word with an optional argument to constrain the part of speech of the word.

Parameters:
  • word (str) – word to find synsets of

  • pos (str) – constraint of the part of speech (i.e. n for Noun, v for Verb, a for Adjective, s for Adjective satellites, and r for Adverb)

  • lang (str) – abbreviation of language (i.e. eng, tha). By default, it is tha

Returns:

Synset all lemmas of the word constrained with the argument pos.

Return type:

list[Synset]

Example:
>>> from pythainlp.corpus.wordnet import synsets
>>>
>>> synsets("ทำงาน")
[Synset('function.v.01'), Synset('work.v.02'),
 Synset('work.v.01'), Synset('work.v.08')]
>>>
>>> synsets("บ้าน", lang="tha"))
[Synset('duplex_house.n.01'), Synset('dwelling.n.01'),
 Synset('house.n.01'), Synset('family.n.01'), Synset('home.n.03'),
 Synset('base.n.14'), Synset('home.n.01'),
 Synset('houseful.n.01'), Synset('home.n.07')]

When specifying the constraint of the part of speech. For example, the word “แรง” could be interpreted as force (n.) or hard (adj.).

>>> from pythainlp.corpus.wordnet import synsets
>>> # By default, allow all parts of speech
>>> synsets("แรง", lang="tha")
>>>
>>> # only Noun
>>> synsets("แรง", pos="n", lang="tha")
[Synset('force.n.03'), Synset('force.n.02')]
>>>
>>> # only Adjective
>>> synsets("แรง", pos="a", lang="tha")
[Synset('hard.s.10'), Synset('strong.s.02')]

pythainlp.corpus.wordnet.synset

pythainlp.corpus.wordnet.synset(name_synsets)[source]

This function returns the synonym set (synset) given the name of the synset (i.e. ‘dog.n.01’, ‘chase.v.01’).

Parameters:

name_synsets (str) – name of the synset

Returns:

Synset of the given name

Return type:

Synset

Example:
>>> from pythainlp.corpus.wordnet import synset
>>>
>>> difficult = synset('difficult.a.01')
>>> difficult
Synset('difficult.a.01')
>>>
>>> difficult.definition()
'not easy; requiring great physical or mental effort to accomplish
           or comprehend or endure'

pythainlp.corpus.wordnet.all_lemma_names

pythainlp.corpus.wordnet.all_lemma_names(pos: str | None = None, lang: str = 'tha')[source]

This function returns all lemma names for all synsets of the given part of speech tag and language. If part of speech tag is not specified, all synsets of all parts of speech will be used.

Parameters:
  • pos (str) – constraint of the part of speech (i.e. n for Noun, v for Verb, a for Adjective, s for Adjective satellites, and r for Adverb). By default, pos is None.

  • lang (str) – abbreviation of language (i.e. eng, tha). By default, it is tha.

Returns:

Synset of lemmas names given the POS and language

Return type:

list[Synset]

Example:
>>> from pythainlp.corpus.wordnet import all_lemma_names
>>>
>>> all_lemma_names()
['อเมริโก_เวสปุชชี',
 'เมืองชีย์เอนเน',
 'การรับเลี้ยงบุตรบุญธรรม',
 'ผู้กัด',
 'ตกแต่งเรือด้วยธง',
 'จิโอวานนิ_เวอร์จินิโอ',...]
>>>
>>> len(all_lemma_names())
80508
>>>
>>> all_lemma_names(pos="a")
['ซึ่งไม่มีแอลกอฮอล์',
 'ซึ่งตรงไปตรงมา',
 'ที่เส้นศูนย์สูตร',
 'ทางจิตใจ',...]
>>>
>>> len(all_lemma_names(pos="a"))
5277

pythainlp.corpus.wordnet.all_synsets

pythainlp.corpus.wordnet.all_synsets(pos: str | None = None)[source]

This function iterates over all synsets constrained by the given part of speech tag.

Parameters:

pos (str) – part of speech tag

Returns:

list of synsets constrained by the given part of speech tag.

Return type:

Iterable[Synset]

Example:
>>> from pythainlp.corpus.wordnet import all_synsets
>>>
>>> generator = all_synsets(pos="n")
>>> next(generator)
Synset('entity.n.01')
>>> next(generator)
Synset('physical_entity.n.01')
>>> next(generator)
Synset('abstraction.n.06')
>>>
>>>  generator = all_synsets()
>>> next(generator)
Synset('able.a.01')
>>> next(generator)
Synset('unable.a.01')

pythainlp.corpus.wordnet.langs

pythainlp.corpus.wordnet.langs()[source]

This function returns a set of ISO-639 language codes.

Returns:

ISO-639 language codes

Return type:

list[str]

Example:
>>> from pythainlp.corpus.wordnet import langs
>>> langs()
['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan',
 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb',
 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob',
 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha',
 'zsm']

pythainlp.corpus.wordnet.lemmas

pythainlp.corpus.wordnet.lemmas(word: str, pos: str | None = None, lang: str = 'tha')[source]

This function returns all lemmas given the word with an optional argument to constrain the part of speech of the word.

Parameters:
  • word (str) – word to find lemmas of

  • pos (str) – constraint of the part of speech (i.e. n for Noun, v for Verb, a for Adjective, s for Adjective satellites, and r for Adverb)

  • lang (str) – abbreviation of language (i.e. eng, tha). By default, it is tha.

Returns:

Synset of all lemmas of the word constrained by the argument pos.

Return type:

list[Lemma]

Example:
>>> from pythainlp.corpus.wordnet import lemmas
>>>
>>> lemmas("โปรด")
[Lemma('like.v.03.โปรด'), Lemma('like.v.02.โปรด')]
>>> print(lemmas("พระเจ้า"))
[Lemma('god.n.01.พระเจ้า'), Lemma('godhead.n.01.พระเจ้า'),
 Lemma('father.n.06.พระเจ้า'), Lemma('god.n.03.พระเจ้า')]

When the part of speech tag is specified:

>>> from pythainlp.corpus.wordnet import lemmas
>>>
>>> lemmas("ม้วน")
[Lemma('roll.v.18.ม้วน'), Lemma('roll.v.17.ม้วน'),
 Lemma('roll.v.08.ม้วน'),  Lemma('curl.v.01.ม้วน'),
 Lemma('roll_up.v.01.ม้วน'), Lemma('wind.v.03.ม้วน'),
 Lemma('roll.n.11.ม้วน')]
>>>
>>> # only lemmas with Noun as the part of speech
>>> lemmas("ม้วน", pos="n")
[Lemma('roll.n.11.ม้วน')]

pythainlp.corpus.wordnet.lemma

pythainlp.corpus.wordnet.lemma(name_synsets)[source]

This function returns lemma object given the name.

Note

Support only English language (eng).

Parameters:

name_synsets (str) – name of the synset

Returns:

lemma object with the given name

Return type:

Lemma

Example:
>>> from pythainlp.corpus.wordnet import lemma
>>>
>>> lemma('practice.v.01.exercise')
Lemma('practice.v.01.exercise')
>>>
>>> lemma('drill.v.03.exercise')
Lemma('drill.v.03.exercise')
>>>
>>> lemma('exercise.n.01.exercise')
Lemma('exercise.n.01.exercise')

pythainlp.corpus.wordnet.lemma_from_key

pythainlp.corpus.wordnet.lemma_from_key(key)[source]

This function returns lemma object given the lemma key. This is similar to lemma() but it needs to be given the key of lemma instead of the name of lemma.

Note

Support only English language (eng).

Parameters:

key (str) – key of the lemma object

Returns:

lemma object with the given key

Return type:

Lemma

Example:
>>> from pythainlp.corpus.wordnet import lemma, lemma_from_key
>>>
>>> practice = lemma('practice.v.01.exercise')
>>> practice.key()
exercise%2:41:00::
>>> lemma_from_key(practice.key())
Lemma('practice.v.01.exercise')

pythainlp.corpus.wordnet.path_similarity

pythainlp.corpus.wordnet.path_similarity(synsets1, synsets2)[source]

This function returns similarity between two synsets based on the shortest path distance calculated using the equation below.

\[path\_similarity = {1 \over shortest\_path\_distance(synsets1, synsets2) + 1}\]

The shortest path distance is calculated by the connection through the is-a (hypernym/hyponym) taxonomy. The score is in the range of 0 to 1. Path similarity of 1 indicates identicality.

Parameters:
  • synsets1 (Synset) – first synset supplied to measures the path similarity with

  • synsets2 (Synset) – second synset supplied to measures the path similarity with

Returns:

path similarity between two synsets

Return type:

float

Example:
>>> from pythainlp.corpus.wordnet import path_similarity, synset
>>>
>>> entity = synset('entity.n.01')
>>> obj = synset('object.n.01')
>>> cat = synset('cat.n.01')
>>>
>>> path_similarity(entity, obj)
0.3333333333333333
>>> path_similarity(entity, cat)
0.07142857142857142
>>> path_similarity(obj, cat)
0.08333333333333333

pythainlp.corpus.wordnet.lch_similarity

pythainlp.corpus.wordnet.lch_similarity(synsets1, synsets2)[source]

This function returns Leacock Chodorow similarity (LCH) between two synsets, based on the shortest path distance and the maximum depth of the taxonomy. The equation to calculate LCH similarity is shown below:

\[lch\_similarity = {-log(shortest\_path\_distance(synsets1, synsets2) \over 2 * taxonomy\_depth}\]
Parameters:
  • synsets1 (Synset) – first synset supplied to measures the LCH similarity

  • synsets2 (Synset) – second synset supplied to measures the LCH similarity

Returns:

LCH similarity between two synsets

Return type:

float

Example:
>>> from pythainlp.corpus.wordnet import lch_similarity, synset
>>>
>>> entity = synset('entity.n.01')
>>> obj = synset('object.n.01')
>>> cat = synset('cat.n.01')
>>>
>>> lch_similarity(entity, obj)
2.538973871058276
>>> lch_similarity(entity, cat)
0.9985288301111273
>>> lch_similarity(obj, cat)
1.1526795099383855

pythainlp.corpus.wordnet.wup_similarity

pythainlp.corpus.wordnet.wup_similarity(synsets1, synsets2)[source]

This function returns Wu-Palmer similarity (WUP) between two synsets, based on the depth of the two senses in the taxonomy and their Least Common Subsumer (most specific ancestor node).

Parameters:
  • synsets1 (Synset) – first synset supplied to measures the WUP similarity with

  • synsets2 (Synset) – second synset supplied to measures the WUP similarity with

Returns:

WUP similarity between two synsets

Return type:

float

Example:
>>> from pythainlp.corpus.wordnet import wup_similarity, synset
>>>
>>> entity = synset('entity.n.01')
>>> obj = synset('object.n.01')
>>> cat = synset('cat.n.01')
>>>
>>> wup_similarity(entity, obj)
0.5
>>> wup_similarity(entity, cat)
0.13333333333333333
>>> wup_similarity(obj, cat)
0.35294117647058826

pythainlp.corpus.wordnet.morphy

pythainlp.corpus.wordnet.morphy(form, pos: str | None = None)[source]

This function finds a possible base form for the given form, with the given part of speech.

Parameters:
  • form (str) – the form to finds the base form of

  • pos (str) – part of speech tag of words to be searched

Returns:

base form of the given form

Return type:

str

Example:
>>> from pythainlp.corpus.wordnet import morphy
>>>
>>> morphy("dogs")
'dogs'
>>>
>>> morphy("thieves")
'thief'
>>>
>>> morphy("mixed")
'mix'
>>>
>>> morphy("calculated")
'calculate'

pythainlp.corpus.wordnet.custom_lemmas

pythainlp.corpus.wordnet.custom_lemmas(tab_file, lang: str)[source]

This function reads a custom tab file (see: http://compling.hss.ntu.edu.sg/omw/) containing mappings of lemmas in the given language.

Parameters:
  • tab_file – Tab file as a file or file-like object

  • lang (str) – abbreviation of language (i.e. eng, tha).

Definition

Synset

A synset is a set of synonyms that share a common meaning. The WordNet module provides functionality to work with these synsets.

This documentation is designed to help you navigate and use the various resources and modules available in the pythainlp.corpus package effectively. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support.

We hope you find this documentation helpful for your natural language processing tasks in the Thai language.