pythainlp.wangchanberta

WangchanBERTa base model: wangchanberta-base-att-spm-uncased [1]

We used WangchanBERTa for Thai name tagger task, part-of-speech and subword tokenizer.

If you want to finetune model, You can read https://github.com/vistec-AI/thai2transformers

Speed Benchmark

Function	Named Entity Recognition	Part of Speech
PyThaiNLP basic function	89.7 ms	312 ms
pythainlp.wangchanberta (CPU)	9.64 s	9.65 s
pythainlp.wangchanberta (GPU)	8.02 s	8 s

Notebook:

Modules

class pythainlp.wangchanberta.NamedEntityRecognition(model: str = 'pythainlp/thainer-corpus-v2-base-model')[source]

__init__(model: str = 'pythainlp/thainer-corpus-v2-base-model') → None[source]

This function tags named-entitiy from text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand :param str model: The model that use wangchanberta pretrained.

get_ner(text: str, pos: bool = False, tag: bool = False) → List[Tuple[str, str]] | str[source]

This function tags named-entitiy from text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:

text (str) – text in Thai to be tagged
tag (bool) – output like html tag.

Returns:

a list of tuple associated with tokenized word group, NER tag, and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type:

Union[list[tuple[str, str]]], str

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

__init__(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

This function tags named-entitiy from text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:

dataset_name (str) –
- thainer - ThaiNER dataset
grouped_entities (bool) – grouped entities

get_ner(text: str, pos: bool = False, tag: bool = False) → List[Tuple[str, str]] | str[source]

This function tags named-entitiy from text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:

text (str) – text in Thai to be tagged
tag (bool) – output like html tag.

Returns:

a list of tuple associated with tokenized word group, NER tag, and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type:

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.segment(text: str) → List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters:: text (str) – text to be tokenized
Returns:: list of subwords
Return type:: list[str]

pythainlp.wangchanberta

Modules

References