pythainlp.benchmarks

The pythainlp.benchmarks contains utility functions for benchmarking tasked related to Thai NLP. At the moment, we have only for word tokenization. Other tasks will be added soon.

Modules

Tokenization

Quality

../_images/evaluation.png

Qualitative evaluation of word tokenization.

pythainlp.benchmarks.word_tokenization.compute_stats(ref_sample: str, raw_sample: str) dict[source]

Compute statistics for tokenization quality

These statistics includes:

Character-Level:

True Positive, False Positive, True Negative, False Negative, Precision, Recall, and f1

Word-Level:

Precision, Recall, and f1

Other:
  • Correct tokenization indicator: {0, 1} sequence indicating the correspoding word is tokenized correctly.

Parameters:
  • ref_sample (str) – ground truth samples

  • samples (str) – samples that we want to evaluate

Returns:

metrics in character and word-level and correctly tokenized word indicators

Return type:

dict[str, float | str]

pythainlp.benchmarks.word_tokenization.benchmark(ref_samples: List[str], samples: List[str]) DataFrame[source]

Performace benchmark of samples.

Please see pythainlp.benchmarks.word_tokenization.compute_stats() for metrics being computed.

Parameters:
  • ref_samples (list[str]) – ground truth samples

  • samples (list[str]) – samples that we want to evaluate

Returns:

dataframe with row x col = len(samples) x len(metrics)

Return type:

pandas.DataFrame

pythainlp.benchmarks.word_tokenization.preprocessing(txt: str, remove_space: bool = True) str[source]

Clean up text before performing evaluation.

Parameters:
  • text (str) – text to be preprocessed

  • remove_space (bool) – whether remove white space

Returns:

preprocessed text

Return type:

str