pythainlp.ulmfit

The ulmfit.utils is utils for ULMFit model.

Modules

pythainlp.ulmfit.utils.get_texts(df)[source]
Meth

get_texts get tuple of tokenized texts and labels

Parameters

df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column

Returns

  • tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list

  • labels - list of labels

pythainlp.ulmfit.utils.get_all(df)[source]
Meth

get_all iterate get_texts for all the entire pandas.DataFrame

Parameters

df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column

Returns

  • tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list

  • labels - list of labels

pythainlp.ulmfit.utils.numericalizer(df, itos=None, max_vocab=60000, min_freq=2, pad_tok='_pad_', unk_tok='_unk_')[source]
Meth

numericalize numericalize tokenized texts for: * tokens with word frequency more than min_freq * at maximum vocab size of max_vocab * add unknown token _unk_ and padding token _pad_ in first and second position * use integer-to-string list itos if avaiable e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…]

Parameters
  • df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column

  • itos (list) – integer-to-string list

  • max_vocab (int) – maximum number of vocabulary (default 60000)

  • min_freq (int) – minimum word frequency to be included (default 2)

  • pad_tok (str) – padding token

  • unk_token (str) – unknown token

Returns

  • lm - numpy.array of numericalized texts

  • tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list

  • labels - list of labels

  • itos - integer-to-string list e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…]

  • stoi - string-to-integer dict e.g. {‘_unk_’:0, ‘_pad_’:1,’first_word’:2,’second_word’:3,…}

  • freq - collections.Counter for word frequency

pythainlp.ulmfit.utils.merge_wgts(em_sz, wgts, itos_pre, itos_cls)[source]
Parameters
  • df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column

  • em_sz (int) – size of embedding vectors (pretrained model is at 300)

  • wgts – saved pyTorch weights of pretrained model

  • itos_pre (list) – integer-to-string list of pretrained model

  • itos_cls (list) – integer-to-string list of current dataset

Returns

merged weights of the model for current dataset

pythainlp.ulmfit.utils.document_vector(ss, m, stoi, tok_engine='newmm')[source]
Meth

document_vector get document vector using pretrained ULMFit model

Parameters
  • ss (str) – sentence to extract embeddings

  • m – pyTorch model

  • stoi (dict) – string-to-integer dict e.g. {‘_unk_’:0, ‘_pad_’:1,’first_word’:2,’second_word’:3,…}

  • tok_engine (str) – tokenization engine (recommend using newmm if you are using pretrained ULMFit model)

Returns

numpy.array of document vector sized 300

pythainlp.ulmfit.utils.about()[source]
class pythainlp.ulmfit.utils.ThaiTokenizer(engine='newmm')[source]
static proc_all(ss)[source]
Meth

proc_all runs proc_text for multiple sentences

Parameters

text (str) – text to process

Returns

processed and tokenized text

static proc_all_mp(ss)[source]
Meth

proc_all runs proc_text for multiple sentences using multiple cpus

Parameters

text (str) – text to process

Returns

processed and tokenized text

proc_text(text)[source]
Meth

proc_text procss and tokenize text removing repetitions, special characters, double spaces

Parameters

text (str) – text to process

Returns

processed and tokenized text

static replace_rep(text)[source]

replace_rep() replace 3 or above repetitive characters with tkrep :param str text: text to process :return: processed text where repetitions are replaced by tkrep followed by number of repetitions Example:

>>> from pythainlp.ulmfit.utils import ThaiTokenizer
>>> tt = ThaiTokenizer()
>>> tt.replace_rep('คือดียยยยยย')
คือดีtkrep6ย
sub_br(text)[source]

sub_br() replace <br> tags with `

`
param str text

text to process

return

procssed text

tokenize(text)[source]
Meth

tokenize text with selected engine

Parameters

text (str) – text to tokenize

Returns

tokenized text