AttaCut: A Fast and Accurate Neural Thai Word Segmenter

travis_ic pypiversion_ic pypidownload_ic arxiv_ic license_ic github_ic

_images/attacut-sych.png

TL;DR: 3-Layer Dilated CNN on syllable and character features. It’s 6x faster than DeepCut (SOTA) while its WL-f1 on BEST 1 is 91%, only 2% lower.

Installatation

pip install attacut

Note: For Windows Users, please install torch before running the command above. Visit PyTorch.org for further instruction.

Usage

Command-Line Interface

$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

Usage:
attacut-cli <src> [--dest=<dest>] [--model=<model>] [--num-cores=<num-cores>] [--batch-size=<batch-size>] [--gpu]
attacut-cli [-v | --version]
attacut-cli [-h | --help]

Arguments:
<src>             Path to input text file to be tokenized

Options:
-h --help         Show this screen.
--model=<model>   Model to be used [default: attacut-sc].
--dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt
-v --version      Show version
--num-cores=<num-cores>  Use multiple-core processing [default: 0]
--batch-size=<batch-size>  Batch size [default: 20]

High-Level API

from attacut import tokenize, Tokenizer

# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)

# alternatively, an AttaCut tokenizer might be instantiated directly,
# allowing one to specify whether to use attacut-sc or attacut-c.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)

AttaCut will be soon integrated into PyThaiNLP’s ecosystem. Please see PyThaiNLP #28 for recent updates

For better efficiency, we recommend using attacut-cli. Please consult our Google Colab tutorial for more detials.

1

NECTEC. BEST: Benchmark for Enhancing the Standard of Thai language processing, 2010.