Table of Contents
Fetching ...

Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

Swati Sharma, Divya V. Sharma, Anubha Gupta

TL;DR

The proposed Task-Lens is a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks, and reveals that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks.

Abstract

The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.

Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

TL;DR

The proposed Task-Lens is a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks, and reveals that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks.

Abstract

The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.
Paper Structure (11 sections, 1 equation, 3 figures, 5 tables)

This paper contains 11 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Task-Lens: It involves dataset discovery, dataset filtering, feature extraction, followed by utility mapping that aligns dataset features with task needs via a Task-feature relevance matrix labeled as Required and Optional or Not Applicable. A dataset is 'Task-Ready' for a task if it satisfies all 'Required' features for a task. Supported tasks include Automatic Speech Recognition (Monolingual) [T1], Automatic Speech Recognition (Multilingual) [T2], Language Identification [T3], Speaker Verification/Identification [T4], Audio Deepfake Detection [T5], Speech Emotion Recognition [T6], Text-to-Speech (Monolingual) [T7], Text-to-Speech (Multilingual) [T8] and Gender Recognition [T9].
  • Figure 2: Distribution of total dataset duration for each task in hours for direct comparison. There is an urgent need of datasets for tasks $T_4$ (SV/SID), $T_5$ (ADD), and $T_6$ (SER).
  • Figure 3: Total speech duration for each Indian language (L$_1$–L$_{26}$) across all 50 datasets. Language L$_8$ (Hindi) and L$_9$ (Indian English) have 3,981 and 16,154 hours of data and were excluded from the figure due to duration; they would have dominated the visualization and obscured relative differences among datasets. Languages like $L_{2}$, $L_{24}$, and $L_{25}$ have the highest duration, whereas languages like $L_{23}$, $L_{3}$, $L_{11}$, and $L_{12}$ are virtually absent.