Interactive online version: Binder badge Google Colab badge

Thai Chunk Parser

In PyThaiNLP, We use chunk data from ORCHID++ corpus.

Read more: https://github.com/PyThaiNLP/pythainlp/pull/524

[5]:
!pip install pythainlp svgling nltk python-crfsuite
Requirement already satisfied: pythainlp in /usr/local/lib/python3.10/dist-packages (4.0.2)
Requirement already satisfied: svgling in /usr/local/lib/python3.10/dist-packages (0.3.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Collecting python-crfsuite
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 993.5/993.5 kB 8.4 MB/s eta 0:00:00
Requirement already satisfied: requests>=2.22.0 in /usr/local/lib/python3.10/dist-packages (from pythainlp) (2.31.0)
Requirement already satisfied: svgwrite in /usr/local/lib/python3.10/dist-packages (from svgling) (1.4.3)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.6)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (2023.7.22)
Installing collected packages: python-crfsuite
Successfully installed python-crfsuite-0.9.9
[1]:
from pythainlp.tokenize import word_tokenize
from pythainlp.tag import pos_tag
from pythainlp.tag import chunk_parse
from nltk.chunk import conlltags2tree
import svgling
[2]:
def test(txt):
    m = [(w,t) for w,t in pos_tag(word_tokenize(txt), engine= 'perceptron',corpus = 'orchid')]
    tag = chunk_parse(m)
    p = [(w,t,tag[i]) for i,(w,t) in enumerate(m)]
    return p
[3]:
svgling.draw_tree(conlltags2tree(test("แมวกินปลา")))
[3]:
../_images/notebooks_pythainlp_chunk_4_0.svg
[4]:
svgling.draw_tree(conlltags2tree(test("คนหนองคายเป็นคนน่ารัก")))
[4]:
../_images/notebooks_pythainlp_chunk_5_0.svg
[5]:
svgling.draw_tree(conlltags2tree(test("ปลาอะไรอยู่ในน้ำ")))
[5]:
../_images/notebooks_pythainlp_chunk_6_0.svg
[6]:
svgling.draw_tree(conlltags2tree(test("ในน้ำมีอะไรอยู่")))
[6]:
../_images/notebooks_pythainlp_chunk_7_0.svg
[7]:
svgling.draw_tree(conlltags2tree(test("ทำไมเขารักคุณ")))
[7]:
../_images/notebooks_pythainlp_chunk_8_0.svg
[8]:
svgling.draw_tree(conlltags2tree(test("คนอะไรอยู่หลังต้นไม้")))
[8]:
../_images/notebooks_pythainlp_chunk_9_0.svg
[8]: