Interactive online version: Binder badge Google Colab badge

Wangchanberta

This notebook for pythainlp.wangchanberta

Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. arXiv preprint arXiv:2101.09635. 2021 Jan 24.

[1]:
#!pip install pythainlp[full]
Collecting https://github.com/PyThaiNLP/pythainlp/archive/add-ner-thai2transformers.zip
  Downloading https://github.com/PyThaiNLP/pythainlp/archive/add-ner-thai2transformers.zip
     \ 12.6MB 1.8MB/s
Collecting python-crfsuite>=0.9.6
  Downloading https://files.pythonhosted.org/packages/79/47/58f16c46506139f17de4630dbcfb877ce41a6355a1bbf3c443edb9708429/python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743kB)
     |████████████████████████████████| 747kB 8.0MB/s
Requirement already satisfied: requests>=2.22.0 in /usr/local/lib/python3.7/dist-packages (from pythainlp==2.3.0.dev0) (2.23.0)
Collecting tinydb>=3.0
  Downloading https://files.pythonhosted.org/packages/af/cd/1ce3d93818cdeda0446b8033d21e5f32daeb3a866bbafd878a9a62058a9c/tinydb-4.4.0-py3-none-any.whl
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (2020.12.5)
Building wheels for collected packages: pythainlp
  Building wheel for pythainlp (setup.py) ... done
  Created wheel for pythainlp: filename=pythainlp-2.3.0.dev0-cp37-none-any.whl size=11006400 sha256=f89b594cbbebbc1940c16b0957a74182f2ea8169de8270e33f0c6bac5d1d4fcd
  Stored in directory: /root/.cache/pip/wheels/9a/be/9e/b2ab1db5c70b14b8d5d8a402e36ed915c2ec906df5c4f4b089
Successfully built pythainlp
Installing collected packages: python-crfsuite, tinydb, pythainlp
Successfully installed pythainlp-2.3.0.dev0 python-crfsuite-0.9.7 tinydb-4.4.0
[2]:
#!pip install transformers sentencepiece
Collecting transformers
  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
     |████████████████████████████████| 1.9MB 8.6MB/s
Collecting sentencepiece
  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
     |████████████████████████████████| 1.2MB 38.4MB/s
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers) (20.9)
Collecting tokenizers<0.11,>=0.10.1
  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
     |████████████████████████████████| 3.2MB 34.1MB/s
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.0.12)
Collecting sacremoses
  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
     |████████████████████████████████| 890kB 42.3MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.41.1)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from transformers) (3.7.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.15.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.0.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers) (3.4.1)
Requirement already satisfied: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers) (3.7.4.3)
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=26dd1871c98e4cd5fe1938dbeba7086606c31e80a945ec9f752859e252fe7068
  Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45
Successfully built sacremoses
Installing collected packages: tokenizers, sacremoses, transformers, sentencepiece
Successfully installed sacremoses-0.0.43 sentencepiece-0.1.95 tokenizers-0.10.1 transformers-4.3.3
[3]:
from pythainlp.wangchanberta import ThaiNameTagger, pos_tag






Named Entity Recognition

Dataset support:

  • thainer

  • lst20

[4]:
t = ThaiNameTagger(dataset_name="thainer")


[5]:
t.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=True)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[5]:
'ทดสอบผมมีชื่อว่า <PERSON>นายวรรณพงษ์  ภัททิยไพบูลย์</PERSON>'
[6]:
t.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=False)
[6]:
[('ทดสอบผมมีชื่อว่า ', 'O'), ('นายวรรณพงษ์  ภัททิยไพบูลย์', 'B-PERSON')]
[7]:
t.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=False)
[7]:
[('โรงเรียน', 'B-ORGANIZATION'),
 ('สวนกุหลาบ', 'I-ORGANIZATION'),
 ('เป็นโรงเรียนที่ดี  แต่ไม่มีสวนกุหลาบ', 'O')]
[8]:
t.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=True)
[8]:
'<ORGANIZATION>โรงเรียนสวนกุหลาบ</ORGANIZATION>เป็นโรงเรียนที่ดี  แต่ไม่มีสวนกุหลาบ'
[9]:
t2 = ThaiNameTagger(dataset_name="lst20", grouped_entities=True)


[10]:
t2.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=True)
[10]:
'ทดสอบผมมีชื่อว่า <TTL>นาย</TTL><PER>วรรณพงษ์ ภัททิยไพบูลย์</PER>'
[11]:
t2.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=False)
[11]:
[('ทดสอบผมมีชื่อว่า ', 'O'),
 ('นาย', 'B-TTL'),
 ('วรรณพงษ์', 'B-PER'),
 (' ', 'I-PER'),
 ('ภัททิยไพบูลย์', 'I-PER')]
[12]:
t2.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=False)
[12]:
[('โรงเรียนสวนกุหลาบ', 'B-ORG'),
 ('เป็นโรงเรียนที่ดี  แต่ไม่มี', 'O'),
 ('สวนกุหลาบ', 'B-ORG')]

Part of speech

It is use lst20 dataset.

[13]:
pos_tag("ผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์")
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[13]:
[('ผม', 'PR'),
 ('มีชื่อว่า', 'NN'),
 (' ', 'NN'),
 ('นาย', 'NN'),
 ('วรรณ', 'NN'),
 ('พงษ์', 'NN'),
 (' ', 'NN'),
 ('ภั', 'NN'),
 ('ท', 'NN'),
 ('ทิ', 'NN'),
 ('ย', 'NN'),
 ('ไพบูลย์', 'NN')]
[14]:
pos_tag("ผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",grouped_word=True)
[14]:
[('ผม', 'PR'), ('มีชื่อว่า  นายวรรณพงษ์  ภัททิยไพบูลย์', 'NN')]

Subword

[15]:
from pythainlp.tokenize import subword_tokenize
[16]:
subword_tokenize("ทดสอบตัดคำย่อย", engine="wangchanberta")
[16]:
['▁', 'ทดสอบ', 'ตัด', 'คํา', 'ย่อย']
[16]: