Retraining¶

Project Overview¶

AttaCut project is structured into several submodules. Four of them are important and worth knowing for further customization beyond what we have provided.

Models contains definitions of models.
Dataloaders contains functionalities to process data for a particular model.
Preprocessing includes methods for cleaning data.
Utils contains other helper functions.

Training Data Format¶

To train an AttaCut model, one needs to prepare data as follow:

For AttaCut-SC¶

Character Dictionary: mapping from a character to an index
Syllable Dicitonary: mapping from a syllable to an index
Training and Validation sets. These sets have to be in the format below:
1000100101::CH_IX CH_IX CH_IX ...::SY_IX SY_IX ...

Explanation	Description
100010010	sequence of labels. 1 indicates a starting-word character.
CH_IX …	sequence of character indices
SY_IX …	sequence of syllable indices. Characters in the same syllable have the same SY_IX

Each line could be a line in your original text.

With these ingredients, one has to create a directory:

# ls -la ./some-dataset
characters.json
syllables.json
training.txt
val.txt

Our AttaCut-SC training data can be found here:

https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip

For AttaCut-C¶

Every detail is similar to the preparation of AttaCut-SC, except that we do need the syllable dictionary and syllable indices (SY_IX SY_IX …).

Our AttaCut-C training data can be found here:

https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip

How to Retrain on Custom Dataset?¶

Our training script is provided in ./scripts/train.py. Several options can be specified when calling the script.

This is an example of how we use it:

$ python ./scripts/train.py --model-name seq_sy_ch_conv_concat \ # seq_sy_ch_conv_concat = attacut-sc
    --model-params "embc:8|embs:8|conv:8|l1:6|do:0.1" \ # emb{c,s} are embedding dimensions
    --data-dir ./some-data  \
    --output-dir ./sink/model-xx  \
    --epoch 10 \
    --batch-size 1024 \
    --lr 0.001 \
    --lr-schedule "step:5|gamma:0.5"

AttaCut’s training code is primarily built to be used on FloydHub. Our training jobs for the released models are:

AttaCut-SC: https://www.floydhub.com/pattt/projects/attacut/50
AttaCut-C: https://www.floydhub.com/pattt/projects/attacut/42

Please let us know if you have any further questions.

Happy coding and less overfitting! 🤪