Table of Contents
Fetching ...

Frustratingly Easy Data Augmentation for Low-Resource ASR

Katsumi Ibaraki, David Chiang

TL;DR

Problem: low-resource ASR suffers from scarce labeled data, especially when external lexical resources are unavailable. Approach: a self-contained augmentation pipeline that creates synthetic text via gloss-based replacement, random replacement, or LLM-based generation and then synthesizes audio with a TTS model before fine-tuning Wav2Vec2-XLSR-53. Key contributions: (i) three effective text-generation strategies, (ii) first ASR systems for Vatlongos, Nashta, Shinekhen Buryat, and Kakabe, and (iii) evidence that increasing phonemic and structural variation can outperform semantic coherence in data-scarce settings; English experiments corroborate broad applicability. Significance: the methods are simple, reproducible, and improve low-resource ASR performance, with practical impact for endangered languages and potential generalization to high-resource languages.

Abstract

This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

Frustratingly Easy Data Augmentation for Low-Resource ASR

TL;DR

Problem: low-resource ASR suffers from scarce labeled data, especially when external lexical resources are unavailable. Approach: a self-contained augmentation pipeline that creates synthetic text via gloss-based replacement, random replacement, or LLM-based generation and then synthesizes audio with a TTS model before fine-tuning Wav2Vec2-XLSR-53. Key contributions: (i) three effective text-generation strategies, (ii) first ASR systems for Vatlongos, Nashta, Shinekhen Buryat, and Kakabe, and (iii) evidence that increasing phonemic and structural variation can outperform semantic coherence in data-scarce settings; English experiments corroborate broad applicability. Significance: the methods are simple, reproducible, and improve low-resource ASR performance, with practical impact for endangered languages and potential generalization to high-resource languages.

Abstract

This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

Paper Structure

This paper contains 14 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of gloss-based replacement (left) and random replacement (right) applied to the beginning of the Shinekhen Buryat sentence, "Then the first queen told him." The gloss-based method replaces each word with an alternative from the set of all words sharing the same gloss in the training data. In contrast, the random replacement method ignores all linguistic information, substituting each word with a random selection from all words in the training data.
  • Figure 2: LLM prompt for generating synthetic sentences.