Table of Contents
Fetching ...

CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data

Kung Yin Hong, Lifeng Han, Riza Batista-Navarro, Goran Nenadic

TL;DR

This work deploys a standard data augmentation methodology by back-translation to a new language translation direction, i.e., Cantonese-to-English, and creates a user-friendly interface for the models included in this project, CantonMT, and makes it available to facilitate Cantonese-to-English MT research.

Abstract

Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.

CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data

TL;DR

This work deploys a standard data augmentation methodology by back-translation to a new language translation direction, i.e., Cantonese-to-English, and creates a user-friendly interface for the models included in this project, CantonMT, and makes it available to facilitate Cantonese-to-English MT research.

Abstract

Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.
Paper Structure (11 sections, 5 figures, 4 tables)

This paper contains 11 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: CantonMT Pipeline: data collection and preprocessing, synthetic data generation, model fine-tuning, model evaluation
  • Figure 2: Learning curves during model training using real data.
  • Figure 3: CantonMT Server and Interface Flowchart diagram.
  • Figure 4: CantonMT Platform with options of model types, training categories, and translating directions. Frontend: TypeScript with Next.js. Backend: Python - Flask
  • Figure 5: Example text extracted from LIHKG website with lots noise before cleaning and anonymisation