pythainlp.ulmfit

Modules

class pythainlp.ulmfit.ThaiTokenizer(lang: str = 'th')[source]

Wrapper around a frozen newmm tokenizer to make it a fastai.BaseTokenizer. (see: https://docs.fast.ai/text.transform#BaseTokenizer)

pythainlp.ulmfit.document_vector(text: str, learn, data, agg: str = 'mean')[source]

This function vectorize Thai input text into a 400 dimension vector using fastai language model and data bunch.

Meth

document_vector get document vector using fastai language model and data bunch

Parameters
  • text (str) – text to be vectorized with fastai language model.

  • learnfastai language model learner

  • datafastai data bunch

  • agg (str) – name of aggregation methods for word embeddings The avialable methods are “mean” and “sum”

Returns

numpy.array of document vector sized 400 based on the encoder of the model

Return type

numpy.ndarray((1, 400))

Example
>>> from pythainlp.ulmfit import document_vectorr
>>> from fastai import *
>>> from fastai.text import *
>>>
>>> # Load Data Bunch
>>> data = load_data(MODEL_PATH, 'thwiki_lm_data.pkl')
>>>
>>> # Initialize language_model_learner
>>> config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1,
     qrnn=False, tie_weights=True, out_bias=True, output_p=0.25,
     hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
>>> trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1)
>>> learn = language_model_learner(data, AWD_LSTM, config=config,
                                   pretrained=False, **trn_args)
>>> document_vector('วันนี้วันดีปีใหม่', learn, data)
See Also
  • A notebook showing how to train ulmfit language model and its usage, Jupyter Notebook

pythainlp.ulmfit.merge_wgts(em_sz, wgts, itos_pre, itos_new)[source]

This function is to insert new vocab into an existing model named wgts and update the model’s weights for new vocab with the average embedding.

Meth

merge_wgts insert pretrained weights and vocab into a new set of weights and vocab; use average if vocab not in pretrained vocab

Parameters
  • em_sz (int) – embedding size

  • wgts – torch model weights

  • itos_pre (list) – pretrained list of vocab

  • itos_new (list) – list of new vocab

Returns

merged torch model weights

pythainlp.ulmfit.process_thai(text: str, pre_rules: Collection = [<function fix_html>, <function reorder_vowels>, <function spec_add_spaces>, <function rm_useless_spaces>, <function rm_useless_newlines>, <function rm_brackets>, <function replace_url>, <function replace_rep_nonum>], tok_func: Callable = <bound method Tokenizer.word_tokenize of <pythainlp.tokenize.core.Tokenizer object>>, post_rules: Collection = [<function ungroup_emoji>, <function lowercase_all>, <function replace_wrep_post_nonum>, <function remove_space>])Collection[str][source]

Process Thai texts for models (with sparse features as default)

Parameters
  • text (str) – text to be cleaned

  • pre_rules (list[func]) – rules to apply before tokenization.

  • tok_func (func) – tokenization function (by default, tok_func is pythainlp.tokenize.word_tokenize())

  • post_rules (list[func]) – rules to apply after tokenizations

Returns

a list of cleaned tokenized texts

Return type

list[str]

Note
  • The default pre-rules consists of fix_html(), pythainlp.util.normalize(), spec_add_spaces(), rm_useless_spaces(), rm_useless_newlines(), rm_brackets() and replace_rep_nonum().

  • The default post-rules consists of ungroup_emoji(), lowercase_all(), replace_wrep_post_nonum(), and remove_space().

Example
  1. Use default pre-rules and post-rules:

>>> from pythainlp.ulmfit import process_thai
>>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp;     "
>>> process_thai(text)
[บ้าน', 'xxrep', '   ', 'อยู่', 'xxwrep', 'นาน', '😂', '🤣',
'😃', '😄', '😅', 'pythainlp', '&']
  1. Modify pre_rules and post_rules arugments with rules provided in pythainlp.ulmfit:

>>> from pythainlp.ulmfit import (
    process_thai,
    replace_rep_after,
    fix_html,
    ungroup_emoji,
    replace_wrep_post,
    remove_space)
>>>
>>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp;     "
>>> process_thai(text,
                 pre_rules=[replace_rep_after, fix_html],
                 post_rules=[ungroup_emoji,
                             replace_wrep_post,
                             remove_space]
                )
['บ้าน', 'xxrep', '5', '()', 'อยู่', 'xxwrep', '2', 'นาน', '😂', '🤣',
 '😃', '😄', '😅', 'PyThaiNLP', '&']
members

tokenizer