pythainlp.phayathaibert

The pythainlp.phayathaibert module is built upon the phayathaibert base model.

Modules

class pythainlp.phayathaibert.ThaiTextProcessor[source]

__init__()[source]

replace_url(text: str) → str[source]

Replace url in text with TK_URL (https://stackoverflow.com/a/6041965) :param str text: text to replace url :return: text where urls are replaced :rtype: str :Example:

>>> replace_url("go to https://github.com")
go to <url>

rm_brackets(text: str) → str[source]

Remove all empty brackets and artifacts within brackets from text. :param str text: text to remove useless brackets :return: text where all useless brackets are removed :rtype: str :Example:

>>> rm_brackets("hey() whats[;] up{*&} man(hey)")
hey whats up man(hey)

replace_newlines(text: str) → str[source]

Replace newlines in text with spaces. :param str text: text to replace all newlines with spaces :return: text where all newlines are replaced with spaces :rtype: str :Example:
>>> rm_useless_spaces("hey whats

up”): hey whats up

rm_useless_spaces(text: str) → str[source]

Remove multiple spaces in text. (code from fastai) :param str text: text to replace useless spaces :return: text where all spaces are reduced to one :rtype: str :Example:

>>> rm_useless_spaces("oh         no")
oh no

replace_spaces(text: str, space_token: str = '<_>') → str[source]

Replace spaces with _ :param str text: text to replace spaces :return: text where all spaces replaced with _ :rtype: str :Example:

>>> replace_spaces("oh no")
oh_no

replace_rep_after(text: str) → str[source]

Replace repetitions at the character level in text :param str text: input text to replace character repetition :return: text with repetitive tokens removed. :rtype: str :Example:

>>> text = "กาาาาาาา"
>>> replace_rep_after(text)
'กา'

replace_wrep_post(toks: List[str]) → List[str][source]

Replace repetitive words post tokenization; fastai replace_wrep does not work well with Thai. :param List[str] toks: list of tokens :return: list of tokens where repetitive words are removed. :rtype: List[str] :Example:

>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"]
>>> replace_wrep_post(toks)
['กา', 'น้ำ']

remove_space(toks: List[str]) → List[str][source]

Do not include space for bag-of-word models. :param List[str] toks: list of tokens :return: List of tokens where space tokens (” “) are filtered out :rtype: List[str] :Example:

>>> toks = ["ฉัน", "เดิน", " ", "กลับ", "บ้าน"]
>>> remove_space(toks)
['ฉัน', 'เดิน', 'กลับ', 'บ้าน']

preprocess(text: str, pre_rules: ~typing.List[~typing.Callable] = [<function ThaiTextProcessor.rm_brackets>, <function ThaiTextProcessor.replace_newlines>, <function ThaiTextProcessor.rm_useless_spaces>, <function ThaiTextProcessor.replace_spaces>, <function ThaiTextProcessor.replace_rep_after>], tok_func: ~typing.Callable = <function word_tokenize>) → str[source]

class pythainlp.phayathaibert.ThaiTextAugmenter[source]

__init__() → None[source]

generate(sample_text: str, word_rank: int, max_length: int = 3, sample: bool = False) → str[source]

augment(text: str, num_augs: int = 3, sample: bool = False) → List[str][source]

Text augmentation from PhayaThaiBERT

Parameters:

text (str) – Thai text
num_augs (int) – an amount of augmentation text needed as an output
sample (bool) – whether to sample the text as an output or not, true if more word diversity is needed

Returns:

list of text augment

Return type:

List[str]

Example:

from pythainlp.augment.lm import ThaiTextAugmenter

aug = ThaiTextAugmenter()
aug.augment("ช้างมีทั้งหมด 50 ตัว บน", num_args=5)

# output = ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้ครับ.',
    'ช้างมีทั้งหมด 50 ตัว บนพื้นดินครับ...',
    'ช้างมีทั้งหมด 50 ตัว บนท้องฟ้าครับ...',
    'ช้างมีทั้งหมด 50 ตัว บนดวงจันทร์.‼',
    'ช้างมีทั้งหมด 50 ตัว บนเขาค่ะ😁']

class pythainlp.phayathaibert.PartOfSpeechTagger(model: str = 'lunarlist/pos_thai_phayathai')[source]

__init__(model: str = 'lunarlist/pos_thai_phayathai') → None[source]

get_tag(sentence: str, strategy: str = 'simple') → List[List[Tuple[str, str]]][source]

Marks sentences with part-of-speech (POS) tags.

Parameters:: sentence (str) – a list of lists of tokenized words
Returns:: a list of lists of tuples (word, POS tag)
Return type:: list[list[tuple[str, str]]]
Example:

Labels POS for given sentence:

from pythainlp.phayathaibert.core import PartOfSpeechTagger

tagger = PartOfSpeechTagger()
tagger.get_tag("แมวทำอะไรตอนห้าโมงเช้า")
# output:
# [[('แมว', 'NOUN'), ('ทําอะไร', 'VERB'), ('ตอนห้าโมงเช้า', 'NOUN')]]

class pythainlp.phayathaibert.NamedEntityTagger(model: str = 'Pavarissy/phayathaibert-thainer')[source]

__init__(model: str = 'Pavarissy/phayathaibert-thainer') → None[source]

get_ner(text: str, tag: bool = False, pos: bool = False, strategy: str = 'simple') → List[Tuple[str, str]] | List[Tuple[str, str, str]] | str[source]

This function tags named entities in text in IOB format.

Parameters:

text (str) – text in Thai to be tagged
pos (bool) – output with part-of-speech tags. (PhayaThaiBERT is supported in PartOfSpeechTagger)

Returns:

a list of tuples associated with tokenized words, NER tags, POS tags (if the parameter pos is specified as True), and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[List[Tuple[str, str]], List[Tuple[str, str, str]], str]

Example:

>>> from pythainlp.phayathaibert.core import NamedEntityTagger
>>>
>>> tagger = NamedEntityTagger()
>>> tagger.get_ner("ทดสอบนายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย")
[('นายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย', 'PERSON'),
('จาก', 'LOCATION'),
('ประเทศไทย', 'LOCATION')]
>>> ner.tag("ทดสอบนายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย", tag=True)
'ทดสอบ<PERSON>นายปวริศ เรืองจุติโพธิ์พาน</PERSON>                <LOCATION>จาก</LOCATION><LOCATION>ประเทศไทย</LOCATION>'

pythainlp.phayathaibert.segment(sentence: str) → List[str][source]

Subword tokenize of PhayaThaiBERT, sentencepiece from WangchanBERTa model with vocabulary expansion.

Parameters:: sentence (str) – text to be tokenized
Returns:: list of subwords
Return type:: list[str]