Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion
Ahmed Adel Attia, Jing Liu, Carol Espy Wilson
TL;DR
Articulation-Informed ASR tackles data scarcity by reintroducing articulatory features into pre-trained transformer ASR via an auxiliary speech inversion task and a cross-attention fusion module. The model extends Wav2Vec2.0 with an MAE-based SI head and a cross-attention block that uses predicted TVs as queries, enabling joint optimization with CTC through an uncertainty-weighted loss $\mathcal{L}_{t}$. Experiments on LibriSpeech show consistent WER reductions, particularly under low-resource and noisy settings, and the base model can approach the performance of larger models with far fewer parameters. This work demonstrates that articulatory representations can improve ASR performance in modern architectures and suggests directions for scaling to larger systems like Whisper, with potential benefits in robustness and reduced hallucinations.
Abstract
Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.
