pythainlp.phayathaibert

The pythainlp.phayathaibert module is built upon the phayathaibert base model.

Modules

class pythainlp.phayathaibert.ThaiTextProcessor[source]
__init__()[source]
replace_url(text: str) str[source]

Replace url in text with TK_URL (https://stackoverflow.com/a/6041965) :param str text: text to replace url :return: text where urls are replaced :rtype: str :Example:

>>> replace_url("go to https://github.com")
go to <url>
rm_brackets(text: str) str[source]

Remove all empty brackets and artifacts within brackets from text. :param str text: text to remove useless brackets :return: text where all useless brackets are removed :rtype: str :Example:

>>> rm_brackets("hey() whats[;] up{*&} man(hey)")
hey whats up man(hey)
replace_newlines(text: str) str[source]

Replace newlines in text with spaces. :param str text: text to replace all newlines with spaces :return: text where all newlines are replaced with spaces :rtype: str :Example:

>>> rm_useless_spaces("hey whats
up”)

hey whats up

rm_useless_spaces(text: str) str[source]

Remove multiple spaces in text. (code from fastai) :param str text: text to replace useless spaces :return: text where all spaces are reduced to one :rtype: str :Example:

>>> rm_useless_spaces("oh         no")
oh no
replace_spaces(text: str, space_token: str = '<_>') str[source]

Replace spaces with _ :param str text: text to replace spaces :return: text where all spaces replaced with _ :rtype: str :Example:

>>> replace_spaces("oh no")
oh_no
replace_rep_after(text: str) str[source]

Replace repetitions at the character level in text :param str text: input text to replace character repetition :return: text with repetitive tokens removed. :rtype: str :Example:

>>> text = "กาาาาาาา"
>>> replace_rep_after(text)
'กา'
replace_wrep_post(toks: List[str]) List[str][source]

Replace repetitive words post tokenization; fastai replace_wrep does not work well with Thai. :param List[str] toks: list of tokens :return: list of tokens where repetitive words are removed. :rtype: List[str] :Example:

>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"]
>>> replace_wrep_post(toks)
['กา', 'น้ำ']
remove_space(toks: List[str]) List[str][source]

Do not include space for bag-of-word models. :param List[str] toks: list of tokens :return: List of tokens where space tokens (” “) are filtered out :rtype: List[str] :Example:

>>> toks = ["ฉัน", "เดิน", " ", "กลับ", "บ้าน"]
>>> remove_space(toks)
['ฉัน', 'เดิน', 'กลับ', 'บ้าน']
preprocess(text: str, pre_rules: ~typing.List[~typing.Callable] = [<function ThaiTextProcessor.rm_brackets>, <function ThaiTextProcessor.replace_newlines>, <function ThaiTextProcessor.rm_useless_spaces>, <function ThaiTextProcessor.replace_spaces>, <function ThaiTextProcessor.replace_rep_after>], tok_func: ~typing.Callable = <function word_tokenize>) str[source]
class pythainlp.phayathaibert.ThaiTextAugmenter[source]
__init__() None[source]
generate(sample_text: str, word_rank: int, max_length: int = 3, sample: bool = False) str[source]
augment(text: str, num_augs: int = 3, sample: bool = False) List[str][source]

Text augmentation from PhayaThaiBERT

Parameters:
  • text (str) – Thai text

  • num_augs (int) – an amount of augmentation text needed as an output

  • sample (bool) – whether to sample the text as an output or not, true if more word diversity is needed

Returns:

list of text augment

Return type:

List[str]

Example:

from pythainlp.augment.lm import ThaiTextAugmenter

aug = ThaiTextAugmenter()
aug.augment("ช้างมีทั้งหมด 50 ตัว บน", num_args=5)

# output = ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้ครับ.',
    'ช้างมีทั้งหมด 50 ตัว บนพื้นดินครับ...',
    'ช้างมีทั้งหมด 50 ตัว บนท้องฟ้าครับ...',
    'ช้างมีทั้งหมด 50 ตัว บนดวงจันทร์.‼',
    'ช้างมีทั้งหมด 50 ตัว บนเขาค่ะ😁']
class pythainlp.phayathaibert.PartOfSpeechTagger(model: str = 'lunarlist/pos_thai_phayathai')[source]
__init__(model: str = 'lunarlist/pos_thai_phayathai') None[source]
get_tag(sentence: str, strategy: str = 'simple') List[List[Tuple[str, str]]][source]

Marks sentences with part-of-speech (POS) tags.

Parameters:

sentence (str) – a list of lists of tokenized words

Returns:

a list of lists of tuples (word, POS tag)

Return type:

list[list[tuple[str, str]]]

Example:

Labels POS for given sentence:

from pythainlp.phayathaibert.core import PartOfSpeechTagger

tagger = PartOfSpeechTagger()
tagger.get_tag("แมวทำอะไรตอนห้าโมงเช้า")
# output:
# [[('แมว', 'NOUN'), ('ทําอะไร', 'VERB'), ('ตอนห้าโมงเช้า', 'NOUN')]]
class pythainlp.phayathaibert.NamedEntityTagger(model: str = 'Pavarissy/phayathaibert-thainer')[source]
__init__(model: str = 'Pavarissy/phayathaibert-thainer') None[source]
get_ner(text: str, tag: bool = False, pos: bool = False, strategy: str = 'simple') List[Tuple[str, str]] | List[Tuple[str, str, str]] | str[source]

This function tags named entities in text in IOB format.

Parameters:
  • text (str) – text in Thai to be tagged

  • pos (bool) – output with part-of-speech tags. (PhayaThaiBERT is supported in PartOfSpeechTagger)

Returns:

a list of tuples associated with tokenized words, NER tags, POS tags (if the parameter pos is specified as True), and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[List[Tuple[str, str]], List[Tuple[str, str, str]], str]

Example:
>>> from pythainlp.phayathaibert.core import NamedEntityTagger
>>>
>>> tagger = NamedEntityTagger()
>>> tagger.get_ner("ทดสอบนายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย")
[('นายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย', 'PERSON'),
('จาก', 'LOCATION'),
('ประเทศไทย', 'LOCATION')]
>>> ner.tag("ทดสอบนายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย", tag=True)
'ทดสอบ<PERSON>นายปวริศ เรืองจุติโพธิ์พาน</PERSON>                <LOCATION>จาก</LOCATION><LOCATION>ประเทศไทย</LOCATION>'
pythainlp.phayathaibert.segment(sentence: str) List[str][source]

Subword tokenize of PhayaThaiBERT, sentencepiece from WangchanBERTa model with vocabulary expansion.

Parameters:

sentence (str) – text to be tokenized

Returns:

list of subwords

Return type:

list[str]