Retraining ---------- Project Overview ^^^^^^^^^^^^^^^^ AttaCut project is structured into several submodules. Four of them are important and worth knowing for further customization beyond what we have provided. 1. **Models** contains definitions of models. 2. **Dataloaders** contains functionalities to process data for a particular model. 3. **Preprocessing** includes methods for cleaning data. 4. **Utils** contains other helper functions. Training Data Format ^^^^^^^^^^^^^^^^^^^^ To train an AttaCut model, one needs to prepare data as follow: For AttaCut-SC """""""""""""" 1. Character Dictionary: mapping from a character to an index 2. Syllable Dicitonary: mapping from a syllable to an index 3. | Training and Validation sets. These sets have to be in the format below: .. code-block:: bash 1000100101::CH_IX CH_IX CH_IX ...::SY_IX SY_IX ... +-----------------+---------------------------------------------------------------------------------------+ | **Explanation** | **Description** | +-----------------+---------------------------------------------------------------------------------------+ | 100010010 | sequence of labels. `1` indicates a starting-word character. | +-----------------+---------------------------------------------------------------------------------------+ | CH_IX ... | sequence of character indices | +-----------------+---------------------------------------------------------------------------------------+ | SY_IX ... | sequence of syllable indices. Characters in the same syllable have the same **SY_IX** | +-----------------+---------------------------------------------------------------------------------------+ | Each line could be a line in your original text. With these ingredients, one has to create a directory: .. code-block:: bash # ls -la ./some-dataset characters.json syllables.json training.txt val.txt Our AttaCut-SC training data can be found here: - https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip For AttaCut-C """"""""""""" Every detail is similar to the preparation of AttaCut-SC, except that we do need the syllable dictionary and syllable indices **(SY_IX SY_IX ...)**. Our AttaCut-C training data can be found here: - https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip How to Retrain on Custom Dataset? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Our training script is provided in `./scripts/train.py`. Several options can be specified when calling the script. This is an example of how we use it: .. code-block:: bash $ python ./scripts/train.py --model-name seq_sy_ch_conv_concat \ # seq_sy_ch_conv_concat = attacut-sc --model-params "embc:8|embs:8|conv:8|l1:6|do:0.1" \ # emb{c,s} are embedding dimensions --data-dir ./some-data \ --output-dir ./sink/model-xx \ --epoch 10 \ --batch-size 1024 \ --lr 0.001 \ --lr-schedule "step:5|gamma:0.5" AttaCut's training code is primarily built to be used on `FloydHub `_. Our training jobs for the released models are: - AttaCut-SC: https://www.floydhub.com/pattt/projects/attacut/50 - AttaCut-C: https://www.floydhub.com/pattt/projects/attacut/42 Please let us know if you have any further questions. Happy coding and less overfitting! 🤪