CANTONMT: Investigating Back-Translation and Model-Switch Mechanisms for Cantonese-English Neural Machine Translation
Kung Yin Hong, Lifeng Han, Riza Batista-Navarro, Goran Nenadic
TL;DR
This work targets Cantonese→English neural MT under low-resource conditions by constructing a new parallel corpus (≈68K sentence pairs) and a large Cantonese monolingual corpus from LIHKG, then evaluating baseline Transformer-based models (Opus-MT, mBART, NLLB) with back-translation and a model-switch mechanism. The best approach, NLLB-mBART with model-switch, achieves competitive automatic metrics compared to Bing and Baidu while enabling an open-source CantonMT toolkit and web interface for broader use. Automatic and human evaluations reveal both the promise and remaining gaps, including text degeneration and terminology handling, underscoring the value of data quality, model collaboration, and targeted improvements for practical Cantonese MT. Overall, the study contributes valuable data resources, methodological advances in back-translation and model-switching, and an accessible platform to spur further Cantonese MT research and deployment.
Abstract
This paper investigates the development and evaluation of machine translation models from Cantonese to English, where we propose a novel approach to tackle low-resource language translations. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation, and model switch, have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using the human evaluation framework HOPES. The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CANTONMT is available at https://github.com/kenrickkung/CantoneseTranslation
