Table of Contents
Fetching ...

High-Resource Translation:Turning Abundance into Accessibility

Abhiram Reddy Yanampally

TL;DR

The paper investigates English-to-Telugu machine translation in low-resource contexts by combining transfer learning, backtranslation, and iterative fine-tuning. It leverages the BPCC corpus and SentencePiece subword tokenization, fine-tuning a pre-trained Opus-MT model with FlashAttention to improve efficiency. By generating synthetic data through backtranslation and conducting five iterative training cycles, the approach achieves a test BLEU of 11.69 (train BLEU 14.85) on a 400-sentence Telugu test set after roughly 37 hours on an 8GB GPU, illustrating both the promise and limitations of applying current MT techniques to Telugu. The work outlines practical pathways to improve accessibility for Telugu speakers and suggests future enhancements in model architectures, hyperparameter tuning, and corpus expansion to boost translation quality.

Abstract

This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques and addressing the challenges associated with low-resource languages. Utilizing the Bharat Parallel Corpus Collection (BPCC) as the primary dataset, the model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities. The research focuses on a comprehensive strategy for improving model performance through data augmentation, optimization of training parameters, and the effective use of pre-trained models. These methodologies aim to create a robust translation system that can handle diverse sentence structures and linguistic nuances in both English and Telugu. This work highlights the significance of innovative data handling techniques and the potential of transfer learning in overcoming limitations posed by sparse datasets in low-resource languages. The study contributes to the field of machine translation and seeks to improve communication between English and Telugu speakers in practical contexts.

High-Resource Translation:Turning Abundance into Accessibility

TL;DR

The paper investigates English-to-Telugu machine translation in low-resource contexts by combining transfer learning, backtranslation, and iterative fine-tuning. It leverages the BPCC corpus and SentencePiece subword tokenization, fine-tuning a pre-trained Opus-MT model with FlashAttention to improve efficiency. By generating synthetic data through backtranslation and conducting five iterative training cycles, the approach achieves a test BLEU of 11.69 (train BLEU 14.85) on a 400-sentence Telugu test set after roughly 37 hours on an 8GB GPU, illustrating both the promise and limitations of applying current MT techniques to Telugu. The work outlines practical pathways to improve accessibility for Telugu speakers and suggests future enhancements in model architectures, hyperparameter tuning, and corpus expansion to boost translation quality.

Abstract

This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques and addressing the challenges associated with low-resource languages. Utilizing the Bharat Parallel Corpus Collection (BPCC) as the primary dataset, the model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities. The research focuses on a comprehensive strategy for improving model performance through data augmentation, optimization of training parameters, and the effective use of pre-trained models. These methodologies aim to create a robust translation system that can handle diverse sentence structures and linguistic nuances in both English and Telugu. This work highlights the significance of innovative data handling techniques and the potential of transfer learning in overcoming limitations posed by sparse datasets in low-resource languages. The study contributes to the field of machine translation and seeks to improve communication between English and Telugu speakers in practical contexts.

Paper Structure

This paper contains 24 sections, 7 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the Backtranslation and Iterative Fine-Tuning Process
  • Figure 2: Sample Translations from English to Telugu