Challenges in Translating Technical Lectures: Insights from the NPTEL
Basudha Raje, Sadanand Venkatraman, Nandana TP, Soumyadeepa Das, Polkam Poojitha, M. Vijaykumar, Tanima Bagchi, Hema A. Murthy
TL;DR
This paper addresses translating spontaneous, domain-specific academic lectures into Indian languages within the NPTEL ecosystem under NEP 2020. It advances a data-centric MT workflow that combines MT outputs from BhashaVerse and SpringLab with human post-editing to produce gold-standard parallel corpora, accompanied by explainable metadata. The study reveals that traditional metrics like BLEU and METEOR poorly capture semantic adequacy for morphologically rich Indian languages, while COMET better reflects meaning preservation, underscoring the need for domain-aware corpora and evaluation frameworks. It argues for dialect-rich data and dialect-aware modeling to improve accessibility in multilingual higher education, contributing to policy-relevant multilingual pedagogy and resources for language-technology research. The work provides a practical bridge between policy goals, platform-scale multilingual data, and corpus-driven MT optimization, with implications for rural access and equitable education in India.
Abstract
This study examines the practical applications and methodological implications of Machine Translation in Indian Languages, specifically Bangla, Malayalam, and Telugu, within emerging translation workflows and in relation to existing evaluation frameworks. The choice of languages prioritized in this study is motivated by a triangulation of linguistic diversity, which illustrates the significance of multilingual accommodation of educational technology under NEP 2020. This is further supported by the largest MOOC portal, i.e., NPTEL, which has served as a corpus to facilitate the arguments presented in this paper. The curation of a spontaneous speech corpora that accounts for lucid delivery of technical concepts, considering the retention of suitable register and lexical choices are crucial in a diverse country like India. The findings of this study highlight metric-specific sensitivity and the challenges of morphologically rich and semantically compact features when tested against surface overlapping metrics.
