Interactive online version: Binder badge Google Colab badge

Thai Chunk Parser

This tutorial demonstrates how to use the chunk_parse function from the PyThaiNLP library for parsing Thai text into phrases. We will use a chunking model trained on ORCHID++ corpus.

Read more:

We will need the following libraries and packages: - PyThaiNLP - NLTK (to preprocess chunk data for visualization) - svgling (for visualization) - python-crfsuite

!pip install pythainlp svgling nltk python-crfsuite
Requirement already satisfied: pythainlp in /usr/local/lib/python3.10/dist-packages (4.0.2)
Requirement already satisfied: svgling in /usr/local/lib/python3.10/dist-packages (0.3.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Collecting python-crfsuite
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 993.5/993.5 kB 8.4 MB/s eta 0:00:00
Requirement already satisfied: requests>=2.22.0 in /usr/local/lib/python3.10/dist-packages (from pythainlp) (2.31.0)
Requirement already satisfied: svgwrite in /usr/local/lib/python3.10/dist-packages (from svgling) (1.4.3)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.6)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.22.0->pythainlp) (2023.7.22)
Installing collected packages: python-crfsuite
Successfully installed python-crfsuite-0.9.9

We need to import the following modules and functions: - word_tokenize – this function takes a Thai text and returns a list of tokenized words - pos_tag – this function takes a list of tokenized words and marks them with part-of-speech (POS) tags - chunk_parse – this function takes words with their POS tags and marks them with inside-outside-beginning (IOB) tags - conlltags2tree – this function is part of the NLTK and converts IOB format to a tree - svgling – this package will be used to visualize the tree in SVG

from pythainlp.tokenize import word_tokenize
from pythainlp.tag import pos_tag
from pythainlp.tag import chunk_parse
from nltk.chunk import conlltags2tree
import svgling

We define a new function test, which will first segment the input text into words (word_tokenize), tag the words with their parts of speech based on the ORCHID++ corpus (pos_tag) and perform chunking (chunk_parse). The function then combines the words, POS and IOB tags into a list of triples p.

def test(txt):
    m = [(w,t) for w,t in pos_tag(word_tokenize(txt), engine= 'perceptron',corpus = 'orchid')]
    tag = chunk_parse(m)
    p = [(w,t,tag[i]) for i,(w,t) in enumerate(m)]
    return p

Finally, we call the test function to chunk several example sentences. We then use the svgling.draw_tree function to visualize the syntactic trees, which were generated from the chunked data by the conlltags2tree function.