pythainlp.ulmfit

Universal Language Model Fine-tuning for Text Classification (ULMFiT).

Modules

class pythainlp.ulmfit.ThaiTokenizer(lang: str = 'th')[source]

Wrapper around a frozen newmm tokenizer to make it a fastai.BaseTokenizer. (see: https://docs.fast.ai/text.transform#BaseTokenizer)

__init__(lang: str = 'th')[source]

static tokenizer(text: str) → List[str][source]

This function tokenizes text with newmm engine and the dictionary specifically for ulmfit related functions (see: Dictonary file (.txt)). :meth: tokenize text with a frozen newmm engine :param str text: text to tokenize :return: tokenized text :rtype: list[str]

Example:

Using pythainlp.ulmfit.ThaiTokenizer.tokenizer() is similar to pythainlp.tokenize.word_tokenize() with ulmfit engine.

>>> from pythainlp.ulmfit import ThaiTokenizer
>>> from pythainlp.tokenize import word_tokenize
>>>
>>> text = "อาภรณ์, จินตมยปัญญา ภาวนามยปัญญา"
>>> ThaiTokenizer.tokenizer(text)
 ['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา',
 ' ', 'ภาวนามยปัญญา']
>>>
>>> word_tokenize(text, engine='ulmfit')
['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา',
 ' ', 'ภาวนามยปัญญา']

add_special_cases(toks)[source]

pythainlp.ulmfit.document_vector(text: str, learn, data, agg: str = 'mean')[source]

This function vectorize Thai input text into a 400 dimension vector using fastai language model and data bunch.

Meth:

document_vector get document vector using fastai language model and data bunch

Parameters:

text (str) – text to be vectorized with fastai language model.
learn – fastai language model learner
data – fastai data bunch
agg (str) – name of aggregation methods for word embeddings The avialable methods are “mean” and “sum”

Returns:

numpy.array of document vector sized 400 based on the encoder of the model

Return type:

numpy.ndarray((1, 400))

Example:

>>> from pythainlp.ulmfit import document_vectorr
>>> from fastai import *
>>> from fastai.text import *
>>>
>>> # Load Data Bunch
>>> data = load_data(MODEL_PATH, 'thwiki_lm_data.pkl')
>>>
>>> # Initialize language_model_learner
>>> config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1,
     qrnn=False, tie_weights=True, out_bias=True, output_p=0.25,
     hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
>>> trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1)
>>> learn = language_model_learner(data, AWD_LSTM, config=config,
                                   pretrained=False, **trn_args)
>>> document_vector('วันนี้วันดีปีใหม่', learn, data)

See Also:

A notebook showing how to train ulmfit language model and its usage, Jupyter Notebook

pythainlp.ulmfit.fix_html(text: str) → str[source]

List of replacements from html strings in test. (code from fastai)

Parameters:

text (str) – text to replace html string

Returns:

text where html strings are replaced

Return type:

str

Example:

>>> from pythainlp.ulmfit import fix_html
>>> fix_html("Anbsp;amp;nbsp;B @.@ ")
A & B.

pythainlp.ulmfit.lowercase_all(toks: Collection[str]) → List[str][source]: Lowercase all English words; English words in Thai texts don’t usually have nuances of capitalization.

pythainlp.ulmfit.merge_wgts(em_sz, wgts, itos_pre, itos_new)[source]

This function is to insert new vocab into an existing model named wgts and update the model’s weights for new vocab with the average embedding.

Meth:

merge_wgts insert pretrained weights and vocab into a new set of weights and vocab; use average if vocab not in pretrained vocab

Parameters:

em_sz (int) – embedding size
wgts – torch model weights
itos_pre (list) – pretrained list of vocab
itos_new (list) – list of new vocab

Returns:

merged torch model weights

Example:

from pythainlp.ulmfit import merge_wgts
import torch

wgts = {'0.encoder.weight': torch.randn(5,3)}
itos_pre = ["แมว", "คน", "หนู"]
itos_new = ["ปลา", "เต่า", "นก"]
em_sz = 3

merge_wgts(em_sz, wgts, itos_pre, itos_new)
# output:
# {'0.encoder.weight': tensor([[0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011]]),
# '0.encoder_dp.emb.weight': tensor([[0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011]]),
# '1.decoder.weight': tensor([[0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011]])}

pythainlp.ulmfit.process_thai(text: str, pre_rules: ~typing.Collection = [<function fix_html>, <function reorder_vowels>, <function spec_add_spaces>, <function rm_useless_spaces>, <function rm_useless_newlines>, <function rm_brackets>, <function replace_url>, <function replace_rep_nonum>], tok_func: ~typing.Callable = <bound method Tokenizer.word_tokenize of <pythainlp.tokenize.core.Tokenizer object>>, post_rules: ~typing.Collection = [<function ungroup_emoji>, <function lowercase_all>, <function replace_wrep_post_nonum>, <function remove_space>]) → Collection[str][source]

Process Thai texts for models (with sparse features as default)

Parameters:

text (str) – text to be cleaned
pre_rules (list[func]) – rules to apply before tokenization.
tok_func (func) – tokenization function (by default, tok_func is pythainlp.tokenize.word_tokenize())
post_rules (list[func]) – rules to apply after tokenizations

Returns:

a list of cleaned tokenized texts

Return type:

list[str]

Note:

The default pre-rules consists of fix_html(), pythainlp.util.normalize(), spec_add_spaces(), rm_useless_spaces(), rm_useless_newlines(), rm_brackets() and replace_rep_nonum().
The default post-rules consists of ungroup_emoji(), lowercase_all(), replace_wrep_post_nonum(), and remove_space().

Example:

Use default pre-rules and post-rules:

>>> from pythainlp.ulmfit import process_thai
>>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp;     "
>>> process_thai(text)
[บ้าน', 'xxrep', '   ', 'อยู่', 'xxwrep', 'นาน', '😂', '🤣',
'😃', '😄', '😅', 'pythainlp', '&']

Modify pre_rules and post_rules arugments with rules provided in pythainlp.ulmfit:

>>> from pythainlp.ulmfit import (
    process_thai,
    replace_rep_after,
    fix_html,
    ungroup_emoji,
    replace_wrep_post,
    remove_space)
>>>
>>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp;     "
>>> process_thai(text,
                 pre_rules=[replace_rep_after, fix_html],
                 post_rules=[ungroup_emoji,
                             replace_wrep_post,
                             remove_space]
                )
['บ้าน', 'xxrep', '5', '()', 'อยู่', 'xxwrep', '2', 'นาน', '😂', '🤣',
 '😃', '😄', '😅', 'PyThaiNLP', '&']

pythainlp.ulmfit.rm_brackets(text: str) → str[source]: Remove all empty brackets and artifacts within brackets from text.

pythainlp.ulmfit.rm_useless_newlines(text: str) → str[source]: Remove multiple newlines in text.

pythainlp.ulmfit.rm_useless_spaces(text: str) → str[source]: Remove multiple spaces in text. (code from fastai)

pythainlp.ulmfit.remove_space(toks: Collection[str]) → List[str][source]

Do not include space for bag-of-word models.

Parameters:: toks (list[str]) – list of tokens
Returns:: list of tokens where space tokens (” “) are filtered out
Return type:: list[str]

pythainlp.ulmfit.replace_rep_after(text: str) → str[source]

Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep 8 ย’ ;instead it will retain the word as ‘น้อย xxrep 8’

Parameters:

text (str) – input text to replace character repetition

Returns:

text with repetitive token xxrep and the counter after character repetition

Return type:

str

Example:

>>> from pythainlp.ulmfit import replace_rep_after
>>>
>>> text = "กาาาาาาา"
>>> replace_rep_after(text)
'กาxxrep7 '

pythainlp.ulmfit.replace_rep_nonum(text: str) → str[source]

Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep ย’; instead it will retain the word as ‘น้อย xxrep ‘

Parameters:

text (str) – input text to replace character repetition

Returns:

text with repetitive token xxrep after character repetition

Return type:

str

Example:

>>> from pythainlp.ulmfit import replace_rep_nonum
>>>
>>> text = "กาาาาาาา"
>>> replace_rep_nonum(text)
'กา xxrep '

pythainlp.ulmfit.replace_wrep_post(toks: Collection[str]) → List[str][source]

Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.

Parameters:

toks (list[str]) – list of tokens

Returns:

list of tokens where xxwrep token and the counter is added in front of repetitive words.

Return type:

list[str]

Example:

>>> from pythainlp.ulmfit import replace_wrep_post_nonum
>>>
>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"]
>>> replace_wrep_post(toks)
['กา', 'xxwrep', '3', 'น้ำ']

pythainlp.ulmfit.replace_wrep_post_nonum(toks: Collection[str]) → List[str][source]

Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.

Parameters:

toks (list[str]) – list of tokens

Returns:

list of tokens where xxwrep token is added in front of repetitive words.

Return type:

list[str]

Example:

>>> from pythainlp.ulmfit import replace_wrep_post_nonum
>>>
>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"]
>>> replace_wrep_post_nonum(toks)
['กา', 'xxwrep', 'น้ำ']

pythainlp.ulmfit.spec_add_spaces(text: str) → str[source]: Add spaces around / and # in text. (code from fastai)

pythainlp.ulmfit.ungroup_emoji(toks: Collection[str]) → List[str][source]

Ungroup Zero Width Joiner (ZVJ) Emojis

See https://emojipedia.org/emoji-zwj-sequence/

members:: tokenizer