pythainlp.benchmarks

The pythainlp.benchmarks contains utility functions for benchmarking tasked related to Thai NLP. At the moment, we have only for word tokenization. Other tasks will be added soon.

Modules

Tokenization

Quality

../_images/evaluation.png — Qualitative evaluation of word tokenization.

pythainlp.benchmarks.word_tokenization.compute_stats(ref_sample: str, raw_sample: str) → dict[source]

Compute statistics for tokenization quality

These statistics includes:

Character-Level:

True Positive, False Positive, True Negative, False Negative, Precision, Recall, and f1

Word-Level:

Precision, Recall, and f1

Other:

Correct tokenization indicator: {0, 1} sequence indicating the correspoding word is tokenized correctly.

Parameters

ref_sample (str) – ground truth samples
samples (str) – samples that we want to evaluate

Returns

metrics in character and word-level and correctly tokenized word indicators

Return type

dict[str, float | str]

pythainlp.benchmarks.word_tokenization.benchmark(ref_samples: List[str], samples: List[str]) → DataFrame[source]

Performace benchmark of samples.

Please see pythainlp.benchmarks.word_tokenization.compute_stats() for metrics being computed.

Parameters

ref_samples (list[str]) – ground truth samples
samples (list[str]) – samples that we want to evaluate

Returns

dataframe with row x col = len(samples) x len(metrics)

Return type

pandas.DataFrame

pythainlp.benchmarks.word_tokenization.preprocessing(txt: str, remove_space: bool = True) → str[source]

Clean up text before performing evaluation.

Parameters

text (str) – text to be preprocessed
remove_space (bool) – whether remove white space

Returns

preprocessed text

Return type

str