pythainlp.wangchanberta

WangchanBERTa base model: wangchanberta-base-att-spm-uncased [1]

We used WangchanBERTa for Thai name tagger task, part-of-speech and subword tokenizer.

If you want to finetune model, You can read https://github.com/vistec-AI/thai2transformers

Speed Benchmark

Function

Named Entity Recognition

Part of Speech

PyThaiNLP basic function

89.7 ms

312 ms

pythainlp.wangchanberta (CPU)

9.64 s

9.65 s

pythainlp.wangchanberta (GPU)

8.02 s

8 s

Notebook:

Modules

class pythainlp.wangchanberta.NamedEntityRecognition(model: str = 'pythainlp/thainer-corpus-v2-base-model')[source]
__init__(model: str = 'pythainlp/thainer-corpus-v2-base-model') None[source]

This function tags named-entitiy from text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand :param str model: The model that use wangchanberta pretrained.

get_ner(text: str, pos: bool = False, tag: bool = False) List[Tuple[str, str]] | str[source]

This function tags named-entitiy from text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • text (str) – text in Thai to be tagged

  • tag (bool) – output like html tag.

Returns:

a list of tuple associated with tokenized word group, NER tag, and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type:

Union[list[tuple[str, str]]], str

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]
__init__(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

This function tags named-entitiy from text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • dataset_name (str) –

    • thainer - ThaiNER dataset

  • grouped_entities (bool) – grouped entities

get_ner(text: str, pos: bool = False, tag: bool = False) List[Tuple[str, str]] | str[source]

This function tags named-entitiy from text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • text (str) – text in Thai to be tagged

  • tag (bool) – output like html tag.

Returns:

a list of tuple associated with tokenized word group, NER tag, and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type:

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.segment(text: str) List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters:

text (str) – text to be tokenized

Returns:

list of subwords

Return type:

list[str]

References