pythainlp.augment

The textaugment is Thai text augment. This function for text augment task.

Modules

class pythainlp.augment.WordNetAug[source]

Text Augment using wordnet

__init__()[source]

find_synonyms(word: str, pos: Optional[str] = None, postag_corpus: str = 'lst20') → List[str][source]

Find synonyms from wordnet

Parameters

word (str) – word
pos (str) – part-of-speech type
postag_corpus (str) – postag corpus name

Returns

list of synonyms

Return type

List[str]

augment(sentence: str, tokenize: object = <function word_tokenize>, max_syn_sent: int = 6, postag: bool = True, postag_corpus: str = 'lst20') → List[List[str]][source]

Text Augment using wordnet

Parameters

sentence (str) – thai sentence
tokenize (object) – function for tokenize word
max_syn_sent (int) – max number for synonyms sentence
postag (bool) – on part-of-speech
postag_corpus (str) – postag corpus name

Returns

list of synonyms

Return type

List[Tuple[str]]

Example

from pythainlp.augment import WordNetAug

aug = WordNetAug()
aug.augment("เราชอบไปโรงเรียน")
# output: [('เรา', 'ชอบ', 'ไป', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไป', 'รร.'),
 ('เรา', 'ชอบ', 'ไป', 'โรงเรียน'),
 ('เรา', 'ชอบ', 'ไป', 'อาคารเรียน'),
 ('เรา', 'ชอบ', 'ไปยัง', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไปยัง', 'รร.')]

class pythainlp.augment.word2vec.Word2VecAug(model: str, tokenize: object, type: str = 'file')[source]

__init__(model: str, tokenize: object, type: str = 'file') → None[source]

Parameters

model (str) – path model
tokenize (object) – tokenize function
type (str) – moodel type (file, binary)

modify_sent(sent: str, p: float = 0.7) → List[List[str]][source]

Parameters

sent (str) – text sentence
p (float) – probability

Return type

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Parameters

sentence (str) – text sentence
n_sent (int) – max number for synonyms sentence
p (int) – probability

Returns

list of synonyms

Return type

List[Tuple[str]]

class pythainlp.augment.word2vec.Thai2fitAug[source]

Text Augment using word2vec from Thai2Fit

Thai2Fit: github.com/cstorm125/thai2fit

__init__()[source]

tokenizer(text: str) → List[str][source]

Parameters: text (str) – thai text
Return type: List[str]

load_w2v()[source]: Load thai2fit word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns

list of text augment

Return type

List[Tuple[str]]

Example

from pythainlp.augment.word2vec import Thai2fitAug

aug = Thai2fitAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('พวกเรา', 'เรียน'), ('ฉัน', 'เรียน')]

class pythainlp.augment.word2vec.LTW2VAug[source]

Text Augment using word2vec from LTW2V

LTW2V: github.com/PyThaiNLP/large-thaiword2vec

__init__()[source]

tokenizer(text: str) → List[str][source]

Parameters: text (str) – thai text
Return type: List[str]

load_w2v()[source]: Load ltw2v word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns

list of text augment

Return type

List[Tuple[str]]

Example

from pythainlp.augment.word2vec import LTW2VAug

aug = LTW2VAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('เขา', 'เรียนหนังสือ'), ('เขา', 'สมัครเรียน')]

class pythainlp.augment.lm.FastTextAug(model_path: str)[source]

Text Augment from FastText

Parameters: model_path (str) – path of model file

__init__(model_path: str)[source]

Parameters: model_path (str) – path of model file

tokenize(text: str) → List[str][source]

Thai text tokenize for fasttext

Parameters: text (str) – thai text
Returns: list of word
Return type: List[str]

modify_sent(sent: str, p: float = 0.7) → List[List[str]][source]

Parameters

sent (str) – text sentence
p (float) – probability

Return type

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment from FastText

You wants to download thai model from https://fasttext.cc/docs/en/crawl-vectors.html.

Parameters

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns

list of synonyms

Return type

List[Tuple[str]]

class pythainlp.augment.lm.Thai2transformersAug[source]

__init__()[source]

generate(sentence: str, num_replace_tokens: int = 3)[source]

augment(sentence: str, num_replace_tokens: int = 3) → List[str][source]

Text Augment from wangchanberta

Parameters

sentence (str) – thai sentence
num_replace_tokens (int) – number replace tokens

Returns

list of text augment

Return type

List[str]

Example

from pythainlp.augment.lm import Thai2transformersAug

aug=Thai2transformersAug()

aug.augment("ช้างมีทั้งหมด 50 ตัว บน")
# output: ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้',
 'ช้างมีทั้งหมด 50 ตัว บนสุด',
 'ช้างมีทั้งหมด 50 ตัว บนบก',
 'ช้างมีทั้งหมด 50 ตัว บนนั้น',
 'ช้างมีทั้งหมด 50 ตัว บนหัว']

class pythainlp.augment.word2vec.bpemb_wv.BPEmbAug(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

Thai Text Augment using word2vec from BPEmb

BPEmb: github.com/bheinzerling/bpemb

__init__(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

tokenizer(text: str) → List[str][source]

Parameters: text (str) – thai text
Return type: List[str]

load_w2v()[source]: Load BPEmb model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from BPEmb

Parameters

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns

list of synonyms

Return type

List[str]

Example

from pythainlp.augment.word2vec.bpemb_wv import BPEmbAug

aug = BPEmbAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: ['ผมสอน', 'ผมเข้าเรียน']