Table of Contents
Fetching ...

Revisiting Interpolation Augmentation for Speech-to-Text Generation

Chen Xu, Jie Wang, Xiaoqian Liu, Qianqian Dong, Chunliang Zhang, Tong Xiao, Jingbo Zhu, Dapeng Man, Wu Yang

TL;DR

This work investigates interpolation augmentation (IPA) for speech-to-text (S2T) generation in data-limited settings, proposing two IPA implementations (embedding interpolation in the decoder and encoder-only interpolation) and a robust append-based variant (AIPA). It shows that IPA can improve generalization across ASR and AST tasks and across encoder–decoder and encoder–CTC architectures, particularly when combined with COS-driven training objectives and SpecAugment in a carefully balanced way. AIPA maintains original data distributions to stabilize learning, while COS for CTC offers meaningful reductions in WER; AST gains are achieved by applying learning-objective controls to interpolated samples. Overall, the paper provides practical guidelines for deploying IPA in resource-constrained S2T scenarios and demonstrates its effectiveness across diverse data scales, architectures, and tasks.

Abstract

Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.

Revisiting Interpolation Augmentation for Speech-to-Text Generation

TL;DR

This work investigates interpolation augmentation (IPA) for speech-to-text (S2T) generation in data-limited settings, proposing two IPA implementations (embedding interpolation in the decoder and encoder-only interpolation) and a robust append-based variant (AIPA). It shows that IPA can improve generalization across ASR and AST tasks and across encoder–decoder and encoder–CTC architectures, particularly when combined with COS-driven training objectives and SpecAugment in a carefully balanced way. AIPA maintains original data distributions to stabilize learning, while COS for CTC offers meaningful reductions in WER; AST gains are achieved by applying learning-objective controls to interpolated samples. Overall, the paper provides practical guidelines for deploying IPA in resource-constrained S2T scenarios and demonstrates its effectiveness across diverse data scales, architectures, and tasks.

Abstract

Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
Paper Structure (20 sections, 12 equations, 4 figures, 8 tables)

This paper contains 20 sections, 12 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Visualization of encoder representations of both original (depicted as green squares) and interpolated (depicted as pink circles) samples in the IPA method. The upper triangle and lower triangle represent the centers of two data distributions, respectively. The experiment is conducted using the LibriSpeech 100h dataset with an interpolation ratio of $\gamma = 0.3$. Top: without SpecAugment and $\alpha=2.0$. Middle: with SpecAugment and $\alpha=2.0$. Bottom: with SpecAugment and $\alpha=0.2$.
  • Figure 2: Similar to Figure \ref{['visual_ipa']}, visualization of encoder representations in the AIPA method. Top: Enc-Dec model with SpecAugment, $\alpha=0.2$. Bottom: Enc-CTC model with SpecAugment, $\alpha=0.2$.
  • Figure 3: Encoding process of the AIPA method with COS training.
  • Figure 4: Effects of the hyper-parameters $\alpha$ on Enc-CTC models trained with LibriSpeech 100h dataset.