Table of Contents
Fetching ...

Tailored Design of Audio-Visual Speech Recognition Models using Branchformers

David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos

TL;DR

The paper tackles parameter-efficient AVSR by leveraging modality-specific Branchformer cues to tailor a unified cross-modal encoder. It proposes a two-step process: first train audio-only and video-only Branchformer encoders, then design a tailored AVSR encoder by selecting per-layer global/local processing using the modality-specific scores; the training uses a CTC/Attention loss $\mathcal{L} = \alpha \log p_{ctc}(\mathbf{Y}|\mathbf{X}) + (1-\alpha) \log p_{attn}(\mathbf{Y}|\mathbf{X})$. On English benchmarks it achieves about 2.5% WER and on Spanish benchmarks an average around 9.1% WER, with approximately 59.3M parameters vs 103.5M for conventional architectures. The adaptive fusion shows acoustic cues dominate (~73%), while the tailored design yields strong robustness to noise and cross-language effectiveness, providing practical guidance for efficient AVSR systems.

Abstract

Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio-visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio-visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Even when trained on a moderate scale of data, our models achieve competitive word error rates (WER) of approximately 2.5\% for English and surpass existing approaches for Spanish, establishing a new benchmark with an average WER of around 9.1\%. These results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at https://github.com/david-gimeno/tailored-avsr.

Tailored Design of Audio-Visual Speech Recognition Models using Branchformers

TL;DR

The paper tackles parameter-efficient AVSR by leveraging modality-specific Branchformer cues to tailor a unified cross-modal encoder. It proposes a two-step process: first train audio-only and video-only Branchformer encoders, then design a tailored AVSR encoder by selecting per-layer global/local processing using the modality-specific scores; the training uses a CTC/Attention loss . On English benchmarks it achieves about 2.5% WER and on Spanish benchmarks an average around 9.1% WER, with approximately 59.3M parameters vs 103.5M for conventional architectures. The adaptive fusion shows acoustic cues dominate (~73%), while the tailored design yields strong robustness to noise and cross-language effectiveness, providing practical guidance for efficient AVSR systems.

Abstract

Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio-visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio-visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Even when trained on a moderate scale of data, our models achieve competitive word error rates (WER) of approximately 2.5\% for English and surpass existing approaches for Spanish, establishing a new benchmark with an average WER of around 9.1\%. These results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at https://github.com/david-gimeno/tailored-avsr.
Paper Structure (11 sections, 7 equations, 6 figures, 3 tables)

This paper contains 11 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Architecture design of an encoder Branchformer layer peng2022branchformer.
  • Figure 2: Visualization of the average Branchformer weights on the LRS2 validation data set. (a) Acoustic Speech Recognition. (b) Visual Speech Recognition.
  • Figure 3: Our proposed method involves preprocessing visual and audio cues, conditioning the resulting temporal-aligned speech features with positional and modality embeddings, and then modeling them using a tailored encoder, whose architecture design is based on the interpretable weights of models previously estimated for audio- and video-only settings, to perform the final speech interpretation in an autoregressive manner through the hybrid CTC/Attention paradigm. Both feed forward networks share parameters across modalities. CE and CTC refer to Cross Entropy and Connectionist Temporal Classification, respectively.
  • Figure 4: Analysis of training with additive noise both for our proposed audio-only ASR and tailored AVSR architectures. Results in WER (%) under babble noisy conditions with multiple SNR levels across the English and Spanish test benchmarks. Shaded areas correspond to 95% confidence intervals. An asterisk (*) indicates experiments where training included babble noise acoustic distortions at varying and random SNR levels.
  • Figure 5: Visualization of the average Branchformer weights on the MuAViC validation data set. (a) Acoustic Speech Recognition. (b) Visual Speech Recognition.
  • ...and 1 more figures