pythainlp.tag

The pythainlp.tag contains functions that are used to tag different parts of a text.

Modules

pythainlp.tag.pos_tag(words: List[str], engine: str = 'perceptron', corpus: str = 'orchid')List[Tuple[str, str]][source]

Part of Speech tagging function.

Parameters
  • words (list) – a list of tokenized words

  • engine (str) –

    • unigram - unigram tagger

    • perceptron - perceptron tagger (default)

    • artagger - RDR POS tagger

  • corpus (str) –

    • orchid - annotated Thai academic articles (default)

    • orchid_ud - annotated Thai academic articles using Universal Dependencies Tags

    • pud - Parallel Universal Dependencies (PUD) treebanks

Returns

returns a list of labels regarding which part of speech it is

pythainlp.tag.pos_tag_sents(sentences: List[List[str]], engine: str = 'perceptron', corpus: str = 'orchid')List[List[Tuple[str, str]]][source]

Part of Speech tagging Sentence function.

Parameters
  • sentences (list) – a list of lists of tokenized words

  • engine (str) –

    • unigram - unigram tagger

    • perceptron - perceptron tagger (default)

    • artagger - RDR POS tagger

  • corpus (str) –

    • orchid - annotated Thai academic articles (default)

    • orchid_ud - annotated Thai academic articles using Universal Dependencies Tags

    • pud - Parallel Universal Dependencies (PUD) treebanks

Returns

returns a list of labels regarding which part of speech it is

pythainlp.tag.tag_provinces(tokens: List[str])List[Tuple[str, str]][source]

Recognize Thailand provinces in text

Input is a list of words Return a list of tuples

Example::
>>> text = ['หนองคาย', 'น่าอยู่']
>>> tag_provinces(text)
[('หนองคาย', 'B-LOCATION'), ('น่าอยู่', 'O')]
class pythainlp.tag.named_entity.ThaiNameTagger[source]
get_ner(text: str, pos: bool = True)Union[List[Tuple[str, str]], List[Tuple[str, str, str]]][source]

Get named-entities in text

Parameters
  • text (string) – Thai text

  • pos (boolean) – get Part-Of-Speech tag (True) or get not (False)

Returns

list of strings with name labels (and part-of-speech tags)

Example::
>>> from pythainlp.tag.named_entity import ThaiNameTagger
>>> ner = ThaiNameTagger()
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")
[('วันที่', 'NOUN', 'O'), (' ', 'PUNCT', 'O'), ('15', 'NUM', 'B-DATE'),
(' ', 'PUNCT', 'I-DATE'), ('ก.ย.', 'NOUN', 'I-DATE'),
(' ', 'PUNCT', 'I-DATE'), ('61', 'NUM', 'I-DATE'),
(' ', 'PUNCT', 'O'), ('ทดสอบ', 'VERB', 'O'),
('ระบบ', 'NOUN', 'O'), ('เวลา', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('14', 'NOUN', 'B-TIME'), (':', 'PUNCT', 'I-TIME'), ('49', 'NUM', 'I-TIME'),
(' ', 'PUNCT', 'I-TIME'), ('น.', 'NOUN', 'I-TIME')]
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.", pos=False)
[('วันที่', 'O'), (' ', 'O'), ('15', 'B-DATE'), (' ', 'I-DATE'),
('ก.ย.', 'I-DATE'), (' ', 'I-DATE'), ('61', 'I-DATE'), (' ', 'O'),
('ทดสอบ', 'O'), ('ระบบ', 'O'), ('เวลา', 'O'), (' ', 'O'), ('14', 'B-TIME'),
(':', 'I-TIME'), ('49', 'I-TIME'), (' ', 'I-TIME'), ('น.', 'I-TIME')]

Tagger Engines

perceptron

Perceptron tagger is the part-of-speech tagging using the averaged, structured perceptron algorithm.

unigram

Unigram tagger doesn’t take the ordering of words in the list into account.

References

1

Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. Building a Thai Part-Of-Speech Tagged Corpus (ORCHID). The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999./p>