AttaCut: A Fast and Accurate Neural Thai Word Segmenter

travis_ic pypiversion_ic pypidownload_ic arxiv_ic license_ic github_ic

_images/attacut-sych.png

TL;DR: 3-Layer Dilated CNN on syllable and character features. It’s 6x faster than DeepCut (SOTA) while its WL-f1 on BEST 1 is 91%, only 2% lower.

Installatation

pip install attacut

Note: For Windows Users, please install torch before running the command above. Visit PyTorch.org for further instruction.

Usage

Command-Line Interface

$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

Usage:
attacut-cli <src> [--dest=<dest>] [--model=<model>]
attacut-cli (-h | --help)

Options:
-h --help         Show this screen.
--model=<model>   Model to be used [default: attacut-sc].
--dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt

High-Level API

from attacut import tokenize, Tokenizer

# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)

# alternatively, an AttaCut tokenizer might be instantiated directly,
# allowing one to specify whether to use attacut-sc or attacut-c.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)

AttaCut will be soon integrated into PyThaiNLP’s ecosystem. Please see PyThaiNLP #28 for recent updates

1

NECTEC. BEST: Benchmark for Enhancing the Standard of Thai language processing, 2010.