Table of Contents
Fetching ...

An Efficient Approach for Machine Translation on Low-resource Languages: A Case Study in Vietnamese-Chinese

Tran Ngoc Son, Nguyen Anh Tu, Nguyen Minh Tri

TL;DR

The paper tackles MT for a low-resource language pair, Vietnamese-Chinese, by leveraging the multilingual pre-trained model mBART-50 and domain-focused data augmentation. It introduces a three-step pipeline: fine-tune on existing bilingual data, use TF-IDF to select domain-relevant monolingual sentences, and generate synthetic parallel data from those sentences to augment training. The approach yields notable BLEU gains, including a $3.19$ point improvement in Chinese-to-Vietnamese translation, and demonstrates competitive performance against established engines on certain directions. This work highlights the value of targeted monolingual data and synthetic augmentation for enhancing MT in low-resource settings and provides a practical framework for similar language pairs.

Abstract

Despite the rise of recent neural networks in machine translation, those networks do not work well if the training data is insufficient. In this paper, we proposed an approach for machine translation in low-resource languages such as Vietnamese-Chinese. Our proposed method leveraged the power of the multilingual pre-trained language model (mBART) and both Vietnamese and Chinese monolingual corpus. Firstly, we built an early bird machine translation model using the bilingual training dataset. Secondly, we used TF-IDF technique to select sentences from the monolingual corpus which are the most related to domains of the parallel dataset. Finally, the first model was used to synthesize the augmented training data from the selected monolingual corpus for the translation model. Our proposed scheme showed that it outperformed 8% compared to the transformer model. The augmented dataset also pushed the model performance.

An Efficient Approach for Machine Translation on Low-resource Languages: A Case Study in Vietnamese-Chinese

TL;DR

The paper tackles MT for a low-resource language pair, Vietnamese-Chinese, by leveraging the multilingual pre-trained model mBART-50 and domain-focused data augmentation. It introduces a three-step pipeline: fine-tune on existing bilingual data, use TF-IDF to select domain-relevant monolingual sentences, and generate synthetic parallel data from those sentences to augment training. The approach yields notable BLEU gains, including a point improvement in Chinese-to-Vietnamese translation, and demonstrates competitive performance against established engines on certain directions. This work highlights the value of targeted monolingual data and synthetic augmentation for enhancing MT in low-resource settings and provides a practical framework for similar language pairs.

Abstract

Despite the rise of recent neural networks in machine translation, those networks do not work well if the training data is insufficient. In this paper, we proposed an approach for machine translation in low-resource languages such as Vietnamese-Chinese. Our proposed method leveraged the power of the multilingual pre-trained language model (mBART) and both Vietnamese and Chinese monolingual corpus. Firstly, we built an early bird machine translation model using the bilingual training dataset. Secondly, we used TF-IDF technique to select sentences from the monolingual corpus which are the most related to domains of the parallel dataset. Finally, the first model was used to synthesize the augmented training data from the selected monolingual corpus for the translation model. Our proposed scheme showed that it outperformed 8% compared to the transformer model. The augmented dataset also pushed the model performance.

Paper Structure

This paper contains 12 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Flow of data processing and model training