pythainlp.benchmarks

The pythainlp.benchmarks contains utility functions for benchmarking tasked related to Thai NLP. At the moment, we have only for word tokenization. Other tasks will be added soon.

Modules

Tokenization

Quality

../_images/evaluation.png

Qualitative evaluation of word tokenization.

pythainlp.benchmarks.word_tokenization.compute_stats(ref_sample: str, raw_sample: str) dict[source]

Compute statistics for tokenization quality

These statistics includes:

Character-Level:

True Positive, False Positive, True Negative, False Negative, Precision, Recall, and f1

Word-Level:

Precision, Recall, and f1

Other:
  • Correct tokenization indicator: {0, 1} sequence indicating the correspoding word is tokenized correctly.

Parameters
  • ref_sample (str) – ground truth samples

  • samples (str) – samples that we want to evaluate

Returns

metrics in character and word-level and correctly tokenized word indicators

Return type

dict[str, float | str]

pythainlp.benchmarks.word_tokenization.benchmark(ref_samples: List[str], samples: List[str]) DataFrame[source]

Performace benchmark of samples.

Please see pythainlp.benchmarks.word_tokenization.compute_stats() for metrics being computed.

Parameters
  • ref_samples (list[str]) – ground truth samples

  • samples (list[str]) – samples that we want to evaluate

Returns

dataframe with row x col = len(samples) x len(metrics)

Return type

pandas.DataFrame

pythainlp.benchmarks.word_tokenization.preprocessing(txt: str, remove_space: bool = True) str[source]

Clean up text before performing evaluation.

Parameters
  • text (str) – text to be preprocessed

  • remove_space (bool) – whether remove white space

Returns

preprocessed text

Return type

str