pythainlp.wangchanberta

The pythainlp.wangchanberta module is built upon the WangchanBERTa base model, specifically the wangchanberta-base-att-spm-uncased model, as detailed in the paper by Lowphansirikul et al. [^Lowphansirikul_2021].

This base model is utilized for various natural language processing tasks in the Thai language, including named entity recognition, part-of-speech tagging, and subword tokenization.

If you intend to fine-tune the model or explore its capabilities further, please refer to the [thai2transformers repository](https://github.com/vistec-AI/thai2transformers).

Speed Benchmark

Function

Named Entity Recognition

Part of Speech

PyThaiNLP basic function

89.7 ms

312 ms

pythainlp.wangchanberta (CPU)

9.64 s

9.65 s

pythainlp.wangchanberta (GPU)

8.02 s

8 s

For a comprehensive performance benchmark, the following notebooks are available:

Modules

class pythainlp.wangchanberta.NamedEntityRecognition(model: str = 'pythainlp/thainer-corpus-v2-base-model')[source]

The NamedEntityRecognition class is a fundamental component for identifying named entities in Thai text. It allows you to extract entities such as names, locations, and organizations from text data.

__init__(model: str = 'pythainlp/thainer-corpus-v2-base-model') None[source]

This function tags named entities in text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand :param str model: The model that use wangchanberta pretrained.

get_ner(text: str, pos: bool = False, tag: bool = False) List[Tuple[str, str]] | str[source]

This function tags named entities in text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • text (str) – text in Thai to be tagged

  • tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized word groups, NER tags, and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]]], str

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

The ThaiNameTagger class is designed for tagging Thai names within text. This is essential for tasks such as entity recognition, information extraction, and text classification.

__init__(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

This function tags named entities in text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • dataset_name (str) –

    • thainer - ThaiNER dataset

  • grouped_entities (bool) – grouped entities

get_ner(text: str, pos: bool = False, tag: bool = False) List[Tuple[str, str]] | str[source]

This function tags named entities in text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • text (str) – text in Thai to be tagged

  • tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized word groups, NER tags, and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.segment(text: str) List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters:

text (str) – text to be tokenized

Returns:

list of subwords

Return type:

list[str]

The segment function is a subword tokenization tool that breaks down text into subword units, offering a foundation for further text processing and analysis.

References

[^Lowphansirikul_2021] Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. [ArXiv:2101.09635](http://arxiv.org/abs/2101.09635) [Internet]. 2021 Jan 23 [cited 2021 Feb 27].