pythainlp.wangchanberta

WangchanBERTa base model: wangchanberta-base-att-spm-uncased 1

We used WangchanBERTa for Thai name tagger task, part-of-speech and subword tokenizer.

Speed Benchmark

Function

Named Entity Recognition

Part of Speech

PyThaiNLP basic function

89.7 ms

312 ms

pythainlp.wangchanberta (CPU)

9.64 s

9.65 s

pythainlp.wangchanberta (GPU)

8.02 s

8 s

Notebook:

Modules

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]
get_ner(text: str, tag: bool = False)Union[List[Tuple[str, str]], str][source]

This function tags named-entitiy from text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters
  • text (str) – text in Thai to be tagged

  • tag (bool) – output like html tag.

Returns

a list of tuple associated with tokenized word group, NER tag, and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.pos_tag(text: str, corpus: str = 'lst20', grouped_word: bool = False)List[Tuple[str, str]][source]

Marks words with part-of-speech (POS) tags.

Parameters
  • text (str) – thai text

  • corpus (str) –

    • lst20 - a LST20 tagger (default)

  • grouped_word (bool) – grouped word (default is False)

Returns

a list of tuples (word, POS tag)

Return type

list[tuple[str, str]]

pythainlp.wangchanberta.segment(text: str)List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters

text (str) – text to be tokenized

Returns

list of subwords

Return type

list[str]

References

1

Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. arXiv:210109635 [cs] [Internet]. 2021 Jan 23 [cited 2021 Feb 27]; Available from: http://arxiv.org/abs/2101.09635