This version was released at Hugging Face Hub, and the model was trained by WangchanBERTa base model.

Training script and split data: https://zenodo.org/record/7761354

Dataset

Size

  • Train: 3,938 docs
  • Validation: 1,313 docs
  • Test: 1,313 Docs

Some data come from crowdsourcing between Dec 2018 - Nov 2019. https://github.com/wannaphong/thai-ner

Domain

  • News (It, politics, economy, social)
  • PR (KKU news)
  • general

Source

  • I use sone data from Nutcha’s theses (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip) and improve data by rechecking and adding more tagging.
  • Twitter
  • Blognone.com - It news
  • thaigov.go.th
  • kku.ac.th

And more (the lists are lost.)

Tag

  • DATA - date
  • TIME - time
  • EMAIL - email
  • LEN - length
  • LOCATION - Location
  • ORGANIZATION - Company / Organization
  • PERSON - Person name
  • PHONE - phone number
  • TEMPERATURE - temperature
  • URL - URL
  • ZIP - Zip code
  • MONEY - the amount
  • LAW - legislation
  • PERCENT - PERCENT

Download: HuggingFace Hub

Model

The model was trained by WangchanBERTa base model.

Validation from the Validation set

  • Precision: 0.830336794125095
  • Recall: 0.873701039168665
  • F1: 0.8514671513892494
  • Accuracy: 0.9736483416628805

Test from the Test set

  • Precision: 0.8199168093956447
  • Recall: 0.8781446540880503
  • F1: 0.8480323927622422
  • Accuracy: 0.9724346779516247

Download: HuggingFace Hub

Cite

Wannaphong Phatthiyaphaibun. (2022). Thai NER 2.0 (2.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7761354

or BibTeX

@dataset{wannaphong_phatthiyaphaibun_2022_7761354,
  author       = {Wannaphong Phatthiyaphaibun},
  title        = {Thai NER 2.0},
  month        = sep,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {2.0},
  doi          = {10.5281/zenodo.7761354},
  url          = {https://doi.org/10.5281/zenodo.7761354}
}