A Tulu Resource for Machine Translation
Manu Narayanan, Noëmi Aepli
TL;DR
This paper introduces the first parallel dataset for English–Tulu translation by extending FLORES-200 with native Tulu translations and leverages transfer learning from Kannada to train an MT system without EN–TCY parallel data. Using IndicBARTSS and the YANMTT toolkit, combined with a principled NMT-Adapt-inspired pipeline (back-translation and denoising autoencoding), the authors demonstrate progressive BLEU gains in both translation directions, with EN→TCY achieving substantial improvement and surpassing a Google Translate baseline in September 2023. The work highlights the feasibility and challenges of MT for a low-resource Dravidian language, stresses community-engaged resource creation with Jai Tulunad, and discusses limitations and future directions for model enhancements and dataset expansion. The dataset and code are released to support ongoing development in low-resource MT and Tulu-language technologies, with implications for linguistic preservation and digital accessibility.
Abstract
We present the first parallel dataset for English-Tulu translation. Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. Furthermore, we use this dataset for evaluation purposes in developing our English-Tulu machine translation model. For the model's training, we leverage resources available for related South Dravidian languages. We adopt a transfer learning approach that exploits similarities between high-resource and low-resource languages. This method enables the training of a machine translation system even in the absence of parallel data between the source and target language, thereby overcoming a significant obstacle in machine translation development for low-resource languages. Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023). The dataset and code are available here: https://github.com/manunarayanan/Tulu-NMT.
