Revisiting Interpolation Augmentation for Speech-to-Text Generation
Chen Xu, Jie Wang, Xiaoqian Liu, Qianqian Dong, Chunliang Zhang, Tong Xiao, Jingbo Zhu, Dapeng Man, Wu Yang
TL;DR
This work investigates interpolation augmentation (IPA) for speech-to-text (S2T) generation in data-limited settings, proposing two IPA implementations (embedding interpolation in the decoder and encoder-only interpolation) and a robust append-based variant (AIPA). It shows that IPA can improve generalization across ASR and AST tasks and across encoder–decoder and encoder–CTC architectures, particularly when combined with COS-driven training objectives and SpecAugment in a carefully balanced way. AIPA maintains original data distributions to stabilize learning, while COS for CTC offers meaningful reductions in WER; AST gains are achieved by applying learning-objective controls to interpolated samples. Overall, the paper provides practical guidelines for deploying IPA in resource-constrained S2T scenarios and demonstrates its effectiveness across diverse data scales, architectures, and tasks.
Abstract
Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
