Table of Contents
Fetching ...

TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation

Bhavana Akkiraju, Srihari Bandarupalli, Swathi Sambangi, Vasavi Ravuri, R Vijaya Saraswathi, Anil Kumar Vuppala

TL;DR

This paper introduces a 46-hour Telugu–English speech translation benchmark derived from the CSTD corpus and systematically compares cascaded and end-to-end architectures in a low-resource setting. It demonstrates that cascaded systems with extensive Telugu-specific training achieve the highest translation quality, while end-to-end models like SeamlessM4T can be competitively fine-tuned with relatively modest data. The study also evaluates six automatic metrics against human judgments, finding ROUGE-L and ChrF++ to be the most reliable discriminators for morphologically rich translations, and provides practical guidance for evaluation. All resources, including data splits and code, are released to support reproducible research in Indic ST and related low-resource languages.

Abstract

Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu--English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.

TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation

TL;DR

This paper introduces a 46-hour Telugu–English speech translation benchmark derived from the CSTD corpus and systematically compares cascaded and end-to-end architectures in a low-resource setting. It demonstrates that cascaded systems with extensive Telugu-specific training achieve the highest translation quality, while end-to-end models like SeamlessM4T can be competitively fine-tuned with relatively modest data. The study also evaluates six automatic metrics against human judgments, finding ROUGE-L and ChrF++ to be the most reliable discriminators for morphologically rich translations, and provides practical guidance for evaluation. All resources, including data splits and code, are released to support reproducible research in Indic ST and related low-resource languages.

Abstract

Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu--English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.

Paper Structure

This paper contains 27 sections, 7 equations, 3 tables.