pythainlp.corpus

The pythainlp.corpus provides access to corpus that comes with PyThaiNLP.

Modules

pythainlp.corpus.countries()frozenset[source]

Return a frozenset of country names in Thai such as “แคนาดา”, “โรมาเนีย”, “แอลจีเรีย”, and “ลาว”.

(See: dev/pythainlp/corpus/countries_th.txt)

return

frozenset containing countries names in Thai

rtype

frozenset

pythainlp.corpus.get_corpus(filename: str, as_is: bool = False)Union[frozenset, list][source]

Read corpus data from file and return a frozenset or a list.

Each line in the file will be a member of the set or the list.

By default, a frozenset will be return, with whitespaces stripped, and empty values and duplicates removed.

If as_is is True, a list will be return, with no modifications in member values and their orders.

(Please see the filename from this file

Parameters

filename (str) – filename of the corpus to be read

Returns

frozenset or list consists of lines in the file

Return type

frozenset or list

Example

from pythainlp.corpus import get_corpus

get_corpus('negations_th.txt')
# output:
# frozenset({'แต่', 'ไม่'})

get_corpus('ttc_freq.txt')
# output:
# frozenset({'โดยนัยนี้\t1',
#    'ตัวบท\t10',
#    'หยิบยื่น\t3',
#     ...})
pythainlp.corpus.get_corpus_db(url: str)requests.models.Response[source]

Get corpus catalog from server.

Parameters

url (str) – URL corpus catalog

pythainlp.corpus.get_corpus_db_detail(name: str, version: Optional[str] = None)dict[source]

Get details about a corpus, using information from local catalog.

Parameters

name (str) – name corpus

Returns

details about a corpus

Return type

dict

pythainlp.corpus.get_corpus_path(name: str, version: Optional[str] = None)Optional[str][source]

Get corpus path.

Parameters

name (str) – corpus name

Returns

path to the corpus or None of the corpus doesn’t exist in the device

Return type

str

Example

If the corpus already exists:

from pythainlp.corpus import get_corpus_path

print(get_corpus_path('ttc'))
# output: /root/pythainlp-data/ttc_freq.txt

If the corpus has not been downloaded yet:

from pythainlp.corpus import download, get_corpus_path

print(get_corpus_path('wiki_lm_lstm'))
# output: None

download('wiki_lm_lstm')
# output:
# Download: wiki_lm_lstm
# wiki_lm_lstm 0.32
# thwiki_lm.pth?dl=1: 1.05GB [00:25, 41.5MB/s]
# /root/pythainlp-data/thwiki_model_lstm.pth

print(get_corpus_path('wiki_lm_lstm'))
# output: /root/pythainlp-data/thwiki_model_lstm.pth
pythainlp.corpus.download(name: str, force: bool = False, url: Optional[str] = None, version: Optional[str] = None)bool[source]

Download corpus.

The available corpus names can be seen in this file: https://github.com/PyThaiNLP/pythainlp-corpus/blob/master/db.json

Parameters
  • name (str) – corpus name

  • force (bool) – force download

  • url (str) – URL of the corpus catalog

  • version (str) – Version of the corpus

Returns

True if the corpus is found and succesfully downloaded. Otherwise, it returns False.

Return type

bool

Example

from pythainlp.corpus import download

download('wiki_lm_lstm', force=True)
# output:
# Corpus: wiki_lm_lstm
# - Downloading: wiki_lm_lstm 0.1
# thwiki_lm.pth:  26%|██▌       | 114k/434k [00:00<00:00, 690kB/s]

By default, downloaded corpus and model will be saved in $HOME/pythainlp-data/ (e.g. /Users/bact/pythainlp-data/wiki_lm_lstm.pth).

pythainlp.corpus.remove(name: str)bool[source]

Remove corpus

Parameters

name (str) – corpus name

Returns

True if the corpus is found and succesfully removed. Otherwise, it returns False.

Return type

bool

Example

from pythainlp.corpus import remove, get_corpus_path, get_corpus

print(remove('ttc'))
# output: True

print(get_corpus_path('ttc'))
# output: None

get_corpus('ttc')
# output:
# FileNotFoundError: [Errno 2] No such file or directory:
# '/usr/local/lib/python3.6/dist-packages/pythainlp/corpus/ttc'
pythainlp.corpus.provinces(details: bool = False)Union[frozenset, list][source]

Return a frozenset of Thailand province names in Thai such as “กระบี่”, “กรุงเทพมหานคร”, “กาญจนบุรี”, and “อุบลราชธานี”.

(See: dev/pythainlp/corpus/thailand_provinces_th.txt)

param bool details

return details of provinces or not

return

frozenset containing province names of Thailand (if details is False) or list containing dict of province names and details such as [{‘name_th’: ‘นนทบุรี’, ‘abbr_th’: ‘นบ’, ‘name_en’: ‘Nonthaburi’, ‘abbr_en’: ‘NBI’}].

rtype

frozenset or list

pythainlp.corpus.thai_stopwords()frozenset[source]

Return a frozenset of Thai stopwords such as “มี”, “ไป”, “ไง”, “ขณะ”, “การ”, and “ประการหนึ่ง”.

(See: dev/pythainlp/corpus/stopwords_th.txt)

return

frozenset containing stopwords.

rtype

frozenset

pythainlp.corpus.thai_words()frozenset[source]

Return a frozenset of Thai words such as “กติกา”, “กดดัน”, “พิษ”, and “พิษภัย”.

(See: dev/pythainlp/corpus/words_th.txt)

return

frozenset containing words in Thai language.

rtype

frozenset

pythainlp.corpus.thai_syllables()frozenset[source]

Return a frozenset of Thai syllables such as “กรอบ”, “ก็”, “๑”, “โมบ”, “โมน”, “โม่ง”, “กา”, “ก่า”, and, “ก้า”.

(See: dev/pythainlp/corpus/syllables_th.txt)

return

frozenset containing syllables in Thai language.

rtype

frozenset

pythainlp.corpus.thai_negations()frozenset[source]

Return a frozenset of Thai negation words including “ไม่” and “แต่”.

(See: dev/pythainlp/corpus/negations_th.txt)

return

frozenset containing negations in Thai language.

rtype

frozenset

pythainlp.corpus.thai_family_names()frozenset[source]

Return a frozenset of Thai family names

(See: dev/pythainlp/corpus/family_names_th.txt)

return

frozenset containing Thai family names.

rtype

frozenset

pythainlp.corpus.thai_female_names()frozenset[source]

Return a frozenset of Thai female names

(See: dev/pythainlp/corpus/person_names_female_th.txt)

return

frozenset containing Thai female names.

rtype

frozenset

pythainlp.corpus.thai_male_names()frozenset[source]

Return a frozenset of Thai male names

(See: dev/pythainlp/corpus/person_names_male_th.txt)

return

frozenset containing Thai male names.

rtype

frozenset

pythainlp.corpus.conceptnet.edges(word: str, lang: str = 'th')[source]

Get edges from ConceptNet API. ConceptNet is a public semantic network, designed to help computers understand the meanings of words that people use.

For example, the term “ConceptNet” is a “knowledge graph”, and “knowledge graph” has “common sense knowledge” which is a part of “artificial inteligence”. Also, “ConcepNet” is used for “natural language understanding” which is a part of “artificial intelligence”.

“ConceptNet” –is a–> “knowledge graph” –has–> “common sense” –a part of–> “artificial intelligence”
“ConceptNet” –used for–> “natural language understanding” –a part of–> “artificial intelligence”

With this illustration, it shows relationships (represented as Edge) between the terms (represented as Node)

Parameters
  • word (str) – word to be sent to ConceptNet API

  • lang (str) – abbreviation of language (i.e. th for Thai, en for English, or ja for Japan). By default, it is th (Thai).

Returns

return edges of the given word according to the ConceptNet network.

Return type

list[dict]

Example

from pythainlp.corpus.conceptnet import edges

edges('hello', lang='en')
# output:
# [{
#   '@id': '/a/[/r/IsA/,/c/en/hello/,/c/en/greeting/]',
#   '@type': 'Edge',
#   'dataset': '/d/conceptnet/4/en',
#   'end': {'@id': '/c/en/greeting',
#   '@type': 'Node',
#   'label': 'greeting',
#   'language': 'en',
#   'term': '/c/en/greeting'},
#   'license': 'cc:by/4.0',
#   'rel': {'@id': '/r/IsA', '@type': 'Relation', 'label': 'IsA'},
#   'sources': [
#   {
#   '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/bmsacr/]',
#   '@type': 'Source',
#   'activity': '/s/activity/omcs/vote',
#   'contributor': '/s/contributor/omcs/bmsacr'
#   },
#   {
#     '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/test/]',
#     '@type': 'Source',
#     'activity': '/s/activity/omcs/vote',
#     'contributor': '/s/contributor/omcs/test'}
#   ],
#   'start': {'@id': '/c/en/hello',
#   '@type': 'Node',
#   'label': 'Hello',
#   'language': 'en',
#   'term': '/c/en/hello'},
#   'surfaceText': '[[Hello]] is a kind of [[greeting]]',
#   'weight': 3.4641016151377544
# }, ...]

edges('สวัสดี', lang='th')
# output:
# [{
#  '@id': '/a/[/r/RelatedTo/,/c/th/สวัสดี/n/,/c/en/prosperity/]',
#  '@type': 'Edge',
#  'dataset': '/d/wiktionary/en',
#  'end': {'@id': '/c/en/prosperity',
#  '@type': 'Node',
#  'label': 'prosperity',
#  'language': 'en',
#  'term': '/c/en/prosperity'},
#  'license': 'cc:by-sa/4.0',
#  'rel': {
#      '@id': '/r/RelatedTo', '@type': 'Relation',
#      'label': 'RelatedTo'},
#  'sources': [{
#  '@id': '/and/[/s/process/wikiparsec/2/,/s/resource/wiktionary/en/]',
#  '@type': 'Source',
#  'contributor': '/s/resource/wiktionary/en',
#  'process': '/s/process/wikiparsec/2'}],
#  'start': {'@id': '/c/th/สวัสดี/n',
#  '@type': 'Node',
#  'label': 'สวัสดี',
#  'language': 'th',
#  'sense_label': 'n',
#  'term': '/c/th/สวัสดี'},
#  'surfaceText': None,
#  'weight': 1.0
# }, ...]

TNC

pythainlp.corpus.tnc.word_freqs()List[Tuple[str, int]][source]

Get word frequency from Thai National Corpus (TNC)

(See: dev/pythainlp/corpus/tnc_freq.txt)

TTC

pythainlp.corpus.ttc.word_freqs()List[Tuple[str, int]][source]

Get word frequency from Thai Textbook Corpus (TTC)

(See: dev/pythainlp/corpus/ttc_freq.txt)

Wordnet

PyThaiNLP API is an exact copy of NLTK WordNet API. See: https://www.nltk.org/howto/wordnet.html

pythainlp.corpus.wordnet.synsets(word: str, pos: Optional[str] = None, lang: str = 'tha')[source]

This function return the synonym sets for all lemmas given the word with an optional argument to constrain the part of speech of the word.

Parameters
  • word (str) – word to find its synsets

  • pos (str) – the part of speech constraint (i.e. n for Noun, v for Verb, a for Adjective, s for Adjective satellites, and r for Adverb)

  • lang (str) – abbreviation of language (i.e. eng, tha). By default, it is tha

Returns

Synset for all lemmas for the word constrained with the argument pos.

Return type

list[Synset]

Example
>>> from pythainlp.corpus.wordnet import synsets
>>>
>>> synsets("ทำงาน")
[Synset('function.v.01'), Synset('work.v.02'),
 Synset('work.v.01'), Synset('work.v.08')]
>>>
>>> synsets("บ้าน", lang="tha"))
[Synset('duplex_house.n.01'), Synset('dwelling.n.01'),
 Synset('house.n.01'), Synset('family.n.01'), Synset('home.n.03'),
 Synset('base.n.14'), Synset('home.n.01'),
 Synset('houseful.n.01'), Synset('home.n.07')]

When specifying the part of speech constrain. For example, the word “แรง” cound be interpreted as force (n.) or hard (adj.).

>>> from pythainlp.corpus.wordnet import synsets
>>> # By default, accept all part of speech
>>> synsets("แรง", lang="tha")
>>>
>>> # only Noun
>>> synsets("แรง", pos="n", lang="tha")
[Synset('force.n.03'), Synset('force.n.02')]
>>>
>>> # only Adjective
>>> synsets("แรง", pos="a", lang="tha")
[Synset('hard.s.10'), Synset('strong.s.02')]
pythainlp.corpus.wordnet.synset(name_synsets)[source]

This function return the synonym set (synset) given the name of synset (i.e. ‘dog.n.01’, ‘chase.v.01’).

Parameters

name_synsets (str) – name of the sysset

Returns

Synset of the given name

Return type

Synset

Example
>>> from pythainlp.corpus.wordnet import synset
>>>
>>> difficult = synset('difficult.a.01')
>>> difficult
Synset('difficult.a.01')
>>>
>>> difficult.definition()
'not easy; requiring great physical or mental effort to accomplish
           or comprehend or endure'
pythainlp.corpus.wordnet.all_lemma_names(pos: Optional[str] = None, lang: str = 'tha')[source]

This function returns all lemma names for all synsets for the given part of speech tag and language. If part of speech tag is not specified, all synsets for all part of speech will be used.

Parameters
  • pos (str) – the part of speech constraint (i.e. n for Noun, v for Verb, a for Adjective, s for Adjective satellites, and r for Adverb). By default, pos is None.

  • lang (str) – abbreviation of language (i.e. eng, tha). By default, it is tha.

Returns

Synset of lemmas names given the pos and language

Return type

list[Synset]

Example
>>> from pythainlp.corpus.wordnet import all_lemma_names
>>>
>>> all_lemma_names()
['อเมริโก_เวสปุชชี',
 'เมืองชีย์เอนเน',
 'การรับเลี้ยงบุตรบุญธรรม',
 'ผู้กัด',
 'ตกแต่งเรือด้วยธง',
 'จิโอวานนิ_เวอร์จินิโอ',...]
>>>
>>> len(all_lemma_names())
80508
>>>
>>> all_lemma_names(pos="a")
['ซึ่งไม่มีแอลกอฮอล์',
 'ซึ่งตรงไปตรงมา',
 'ที่เส้นศูนย์สูตร',
 'ทางจิตใจ',...]
>>>
>>> len(all_lemma_names(pos="a"))
5277
pythainlp.corpus.wordnet.all_synsets(pos: Optional[str] = None)[source]

This function iterates over all synsets constrained by given part of speech tag.

Parameters

pos (str) – part of speech tag

Returns

list of synsets constrained by given part of speech tag.

Return type

Iterable[Synset]

Example
>>> from pythainlp.corpus.wordnet import all_synsets
>>>
>>> generator = all_synsets(pos="n")
>>> next(generator)
Synset('entity.n.01')
>>> next(generator)
Synset('physical_entity.n.01')
>>> next(generator)
Synset('abstraction.n.06')
>>>
>>>  generator = all_synsets()
>>> next(generator)
Synset('able.a.01')
>>> next(generator)
Synset('unable.a.01')
pythainlp.corpus.wordnet.langs()[source]

This function return a set of ISO-639 language codes.

Returns

ISO-639 language codes

Return type

list[str]

Example
>>> from pythainlp.corpus.wordnet import langs
>>> langs()
['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan',
 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb',
 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob',
 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha',
 'zsm']
pythainlp.corpus.wordnet.lemmas(word: str, pos: Optional[str] = None, lang: str = 'tha')[source]

This function returns all lemmas given the word with an optional argument to constrain the part of speech of the word.

Parameters
  • word (str) – word to find its lammas

  • pos (str) – the part of speech constraint (i.e. n for Noun, v for Verb, a for Adjective, s for Adjective satellites, and r for Adverb)

  • lang (str) – abbreviation of language (i.e. eng, tha). By default, it is tha.

Returns

Synset for all lemmas for the word constraine with the argument pos.

Return type

list[Lemma]

Example
>>> from pythainlp.corpus.wordnet import lemmas
>>>
>>> lemmas("โปรด")
[Lemma('like.v.03.โปรด'), Lemma('like.v.02.โปรด')]
>>> print(lemmas("พระเจ้า"))
[Lemma('god.n.01.พระเจ้า'), Lemma('godhead.n.01.พระเจ้า'),
 Lemma('father.n.06.พระเจ้า'), Lemma('god.n.03.พระเจ้า')]

When specify the part of speech tag.

>>> from pythainlp.corpus.wordnet import lemmas
>>>
>>> lemmas("ม้วน")
[Lemma('roll.v.18.ม้วน'), Lemma('roll.v.17.ม้วน'),
 Lemma('roll.v.08.ม้วน'),  Lemma('curl.v.01.ม้วน'),
 Lemma('roll_up.v.01.ม้วน'), Lemma('wind.v.03.ม้วน'),
 Lemma('roll.n.11.ม้วน')]
>>>
>>> # only lammas with Noun as the part of speech
>>> lemmas("ม้วน", pos="n")
[Lemma('roll.n.11.ม้วน')]
pythainlp.corpus.wordnet.lemma(name_synsets)[source]

This function return lemma object given the name.

Note

Support only English language (eng).

Parameters

name_synsets (str) – name of the synset

Returns

lemma object with the given name

Return type

Lemma

Example
>>> from pythainlp.corpus.wordnet import lemma
>>>
>>> lemma('practice.v.01.exercise')
Lemma('practice.v.01.exercise')
>>>
>>> lemma('drill.v.03.exercise')
Lemma('drill.v.03.exercise')
>>>
>>> lemma('exercise.n.01.exercise')
Lemma('exercise.n.01.exercise')
pythainlp.corpus.wordnet.lemma_from_key(key)[source]

This function returns lemma object given the lemma key. This is similar to lemma() but it needs to supply the key of lemma instead of the name.

Note

Support only English language (eng).

Parameters

key (str) – key of the lemma object

Returns

lemma object with the given key

Return type

Lemma

Example
>>> from pythainlp.corpus.wordnet import lemma, lemma_from_key
>>>
>>> practice = lemma('practice.v.01.exercise')
>>> practice.key()
exercise%2:41:00::
>>> lemma_from_key(practice.key())
Lemma('practice.v.01.exercise')
pythainlp.corpus.wordnet.path_similarity(synsets1, synsets2)[source]

This function returns similarity between two synsets based on the shortest path distance from the equation as follows.

\[path\_similarity = {1 \over shortest\_path\_distance(synsets1, synsets2) + 1}\]

The shortest path distance is calculated by the connection through the is-a (hypernym/hyponym) taxonomy. The score is in the ranage 0 to 1. Path similarity of 1 indicates identicality.

Parameters
  • synsets1 (Synset) – first synset supplied to measures the path similarity

  • synsets2 (Synset) – second synset supplied to measures the path similarity

Returns

path similarity between two synsets

Return type

float

Example
>>> from pythainlp.corpus.wordnet import path_similarity, synset
>>>
>>> entity = synset('entity.n.01')
>>> obj = synset('object.n.01')
>>> cat = synset('cat.n.01')
>>>
>>> path_similarity(entity, obj)
0.3333333333333333
>>> path_similarity(entity, cat)
0.07142857142857142
>>> path_similarity(obj, cat)
0.08333333333333333
pythainlp.corpus.wordnet.lch_similarity(synsets1, synsets2)[source]

This function returns Leacock Chodorow similarity (LCH) between two synsets, based on the shortest path distance and the maximum depth of the taxonomy. The equation to calculate LCH similarity is shown below:

\[lch\_similarity = {-log(shortest\_path\_distance(synsets1, synsets2) \over 2 * taxonomy\_depth}\]
Parameters
  • synsets1 (Synset) – first synset supplied to measures the LCH similarity

  • synsets2 (Synset) – second synset supplied to measures the LCH similarity

Returns

LCH similarity between two synsets

Return type

float

Example
>>> from pythainlp.corpus.wordnet import lch_similarity, synset
>>>
>>> entity = synset('entity.n.01')
>>> obj = synset('object.n.01')
>>> cat = synset('cat.n.01')
>>>
>>> lch_similarity(entity, obj)
2.538973871058276
>>> lch_similarity(entity, cat)
0.9985288301111273
>>> lch_similarity(obj, cat)
1.1526795099383855
pythainlp.corpus.wordnet.wup_similarity(synsets1, synsets2)[source]

This function returns Wu-Palmer similarity (WUP) between two synsets, based on the depth of the two senses in the taxonomy and their Least Common Subsumer (most specific ancestor node).

Parameters
  • synsets1 (Synset) – first synset supplied to measures the WUP similarity

  • synsets2 (Synset) – second synset supplied to measures the WUP similarity

Returns

WUP similarity between two synsets

Return type

float

Example
>>> from pythainlp.corpus.wordnet import wup_similarity, synset
>>>
>>> entity = synset('entity.n.01')
>>> obj = synset('object.n.01')
>>> cat = synset('cat.n.01')
>>>
>>> wup_similarity(entity, obj)
0.5
>>> wup_similarity(entity, cat)
0.13333333333333333
>>> wup_similarity(obj, cat)
0.35294117647058826
pythainlp.corpus.wordnet.morphy(form, pos: Optional[str] = None)[source]

This function finds a possible base form for the given form, with the given part of speech.

Parameters
  • form (str) – the form to finds the base form

  • pos (str) – part of speech tag of words to be searched

Returns

base form of the given form

Return type

str

Example
>>> from pythainlp.corpus.wordnet import morphy
>>>
>>> morphy("dogs")
'dogs'
>>>
>>> morphy("thieves")
'thief'
>>>
>>> morphy("mixed")
'mix'
>>>
>>> morphy("calculated")
'calculate'
pythainlp.corpus.wordnet.custom_lemmas(tab_file, lang: str)[source]

This function reads a custom tab file (see: http://compling.hss.ntu.edu.sg/omw/) containing mappings of lemmas in the given language.

Parameters
  • tab_file – Tab file as a file or file-like object

  • lang (str) – abbreviation of language (i.e. eng, tha).

Definition

Synset

a set of synonyms that share a common meaning.