pythainlp.augment

Introduction

The pythainlp.augment module is a powerful toolset for text augmentation in the Thai language. Text augmentation is a process that enriches and diversifies textual data by generating alternative versions of the original text. This module is a valuable resource for improving the quality and variety of Thai language data for NLP tasks.

TextAugment Class

The central component of the pythainlp.augment module is the TextAugment class. This class provides various text augmentation techniques and functions to enhance the diversity of your text data. It offers the following methods:

WordNetAug Class

The WordNetAug class is designed to perform text augmentation using WordNet, a lexical database for English. This class enables you to augment Thai text using English synonyms, offering a unique approach to text diversification. The following methods are available within this class:

class pythainlp.augment.WordNetAug[source]

Text Augment using wordnet

__init__()[source]
find_synonyms(word: str, pos: str | None = None, postag_corpus: str = 'orchid') List[str][source]

Find synonyms using wordnet

Parameters:
  • word (str) – word

  • pos (str) – part-of-speech type

  • postag_corpus (str) – name of POS tag corpus

Returns:

list of synonyms

Return type:

List[str]

augment(sentence: str, tokenize: object = <function word_tokenize>, max_syn_sent: int = 6, postag: bool = True, postag_corpus: str = 'orchid') List[List[str]][source]

Text Augment using wordnet

Parameters:
  • sentence (str) – Thai sentence

  • tokenize (object) – function for tokenizing words

  • max_syn_sent (int) – maximum number of synonymous sentences

  • postag (bool) – use part-of-speech

  • postag_corpus (str) – name of POS tag corpus

Returns:

list of synonyms

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment import WordNetAug

aug = WordNetAug()
aug.augment("เราชอบไปโรงเรียน")
# output: [('เรา', 'ชอบ', 'ไป', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไป', 'รร.'),
 ('เรา', 'ชอบ', 'ไป', 'โรงเรียน'),
 ('เรา', 'ชอบ', 'ไป', 'อาคารเรียน'),
 ('เรา', 'ชอบ', 'ไปยัง', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไปยัง', 'รร.')]

Word2VecAug, Thai2fitAug, LTW2VAug Classes

The pythainlp.augment.word2vec package contains multiple classes for text augmentation using Word2Vec models. These classes include Word2VecAug, Thai2fitAug, and LTW2VAug. Each of these classes allows you to use Word2Vec embeddings to generate text variations. Explore the methods provided by these classes to understand their capabilities.

class pythainlp.augment.word2vec.Word2VecAug(model: str, tokenize: object, type: str = 'file')[source]
__init__(model: str, tokenize: object, type: str = 'file') None[source]
Parameters:
  • model (str) – path of model

  • tokenize (object) – tokenize function

  • type (str) – model type (file, binary)

modify_sent(sent: str, p: float = 0.7) List[List[str]][source]
Parameters:
  • sent (str) – text of sentence

  • p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]
Parameters:
  • sentence (str) – text of sentence

  • n_sent (int) – maximum number of synonymous sentences

  • p (int) – probability

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.word2vec.Thai2fitAug[source]

Text Augment using word2vec from Thai2Fit

Thai2Fit: github.com/cstorm125/thai2fit

__init__()[source]
tokenizer(text: str) List[str][source]
Parameters:

text (str) – Thai text

Return type:

List[str]

load_w2v()[source]

Load Thai2Fit’s word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:
  • sentence (str) – Thai sentence

  • n_sent (int) – number of sentence

  • p (float) – probability of word

Returns:

list of text augmented

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import Thai2fitAug

aug = Thai2fitAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('พวกเรา', 'เรียน'), ('ฉัน', 'เรียน')]
class pythainlp.augment.word2vec.LTW2VAug[source]

Text Augment using word2vec from LTW2V

LTW2V: github.com/PyThaiNLP/large-thaiword2vec

__init__()[source]
tokenizer(text: str) List[str][source]
Parameters:

text (str) – Thai text

Return type:

List[str]

load_w2v()[source]

Load LTW2V’s word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:
  • sentence (str) – Thai sentence

  • n_sent (int) – number of sentence

  • p (float) – probability of word

Returns:

list of text augmented

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import LTW2VAug

aug = LTW2VAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('เขา', 'เรียนหนังสือ'), ('เขา', 'สมัครเรียน')]

FastTextAug and Thai2transformersAug Classes

The pythainlp.augment.lm package offers classes for text augmentation using language models. These classes include FastTextAug and Thai2transformersAug. These classes allow you to use language model-based techniques to diversify text data. Explore their methods to understand their capabilities.

class pythainlp.augment.lm.FastTextAug(model_path: str)[source]

Text Augment from fastText

Parameters:

model_path (str) – path of model file

__init__(model_path: str)[source]
Parameters:

model_path (str) – path of model file

tokenize(text: str) List[str][source]

Thai text tokenization for fastText

Parameters:

text (str) – Thai text

Returns:

list of words

Return type:

List[str]

modify_sent(sent: str, p: float = 0.7) List[List[str]][source]
Parameters:
  • sent (str) – text of sentence

  • p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment from fastText

You may want to download the Thai model from https://fasttext.cc/docs/en/crawl-vectors.html.

Parameters:
  • sentence (str) – Thai sentence

  • n_sent (int) – number of sentences

  • p (float) – probability of word

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.lm.Thai2transformersAug[source]
__init__()[source]
generate(sentence: str, num_replace_tokens: int = 3)[source]
augment(sentence: str, num_replace_tokens: int = 3) List[str][source]

Text augmentation from WangchanBERTa

Parameters:
  • sentence (str) – Thai sentence

  • num_replace_tokens (int) – number replace tokens

Returns:

list of text augment

Return type:

List[str]

Example:

from pythainlp.augment.lm import Thai2transformersAug

aug = Thai2transformersAug()

aug.augment("ช้างมีทั้งหมด 50 ตัว บน")
# output: ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้',
 'ช้างมีทั้งหมด 50 ตัว บนสุด',
 'ช้างมีทั้งหมด 50 ตัว บนบก',
 'ช้างมีทั้งหมด 50 ตัว บนนั้น',
 'ช้างมีทั้งหมด 50 ตัว บนหัว']

BPEmbAug Class

The pythainlp.augment.word2vec.bpemb_wv package contains the BPEmbAug class, which is designed for text augmentation using subword embeddings. This class is particularly useful when working with subword representations for Thai text augmentation.

class pythainlp.augment.word2vec.bpemb_wv.BPEmbAug(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

Thai Text Augment using word2vec from BPEmb

BPEmb: github.com/bheinzerling/bpemb

__init__(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]
tokenizer(text: str) List[str][source]
Parameters:

text (str) – Thai text

Return type:

List[str]

load_w2v()[source]

Load BPEmb model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment using word2vec from BPEmb

Parameters:
  • sentence (str) – Thai sentence

  • n_sent (int) – number of sentence

  • p (float) – probability of word

Returns:

list of synonyms

Return type:

List[str]

Example:

from pythainlp.augment.word2vec.bpemb_wv import BPEmbAug

aug = BPEmbAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: ['ผมสอน', 'ผมเข้าเรียน']

Additional Functions

To further enhance your text augmentation tasks, the pythainlp.augment module offers the following functions:

  • postype2wordnet: This function maps part-of-speech tags to WordNet-compatible POS tags, facilitating the integration of WordNet augmentation with Thai text.

These functions and classes provide diverse techniques for text augmentation in the Thai language, making this module a valuable asset for NLP researchers, developers, and practitioners.

For detailed usage examples and guidelines, please refer to the official PyThaiNLP documentation. The pythainlp.augment module opens up new possibilities for enriching and diversifying Thai text data, leading to improved NLP models and applications.