Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition
Seaone Ok, Min Jun Choi, Eungbeom Kim, Seungu Han, Kyogu Lee
TL;DR
This work tackles noise-robust audio-visual speech recognition (AVSR) by introducing CoBRA, a cross-modal bottleneck fusion framework that mediates modality exchange through a compact set of learnable tokens. Audio and visual encoders interact exclusively via bottleneck tokens, enabling the audio stream to leverage visual cues even in adverse conditions; fusion depth and bottleneck size are shown to critically impact performance. Empirically, CoBRA achieves strong results on LRS3 and LRS2 with limited training data, including notable improvements under noisy conditions, and exhibits noise-adaptive visual reliance via attention rollout analyses. The approach offers data-efficient robustness for AVSR and suggests avenues for scaling with larger pretraining and additional modalities.
Abstract
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems.
