Table of Contents
Fetching ...

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

Xinyu Wang, Haotian Jiang, Haolin Huang, Yu Fang, Mengjie Xu, Qian Wang

TL;DR

The paper tackles efficient audio-visual speech recognition by introducing an asymmetric architecture that treats audio as the primary modality while incorporating visual cues via a Dual Conformer Interaction Module (DCIM). It combines cross-modal adapters and a three-stage pre-training regime to enable lightweight yet effective fusion, achieving substantial parameter and WER reductions compared with baselines. Empirically, the approach delivers competitive WER with far fewer parameters (53M) than large AVSR models and demonstrates robustness to noise, underscoring its practicality for resource-constrained deployments. The work highlights the importance of both information completion and purification in cross-modal fusion and paves the way for efficient AVSR in real-world scenarios.

Abstract

Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a pre-training method that further optimizes model performance by selectively updating parameters, leading to significant improvements in efficiency. Unlike conventional models that require the system to independently learn the hierarchical relationship between audio and visual modalities, our approach incorporates this distinction directly into the model architecture. This design enhances both efficiency and performance, resulting in a more practical and effective solution for AVSR tasks.

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

TL;DR

The paper tackles efficient audio-visual speech recognition by introducing an asymmetric architecture that treats audio as the primary modality while incorporating visual cues via a Dual Conformer Interaction Module (DCIM). It combines cross-modal adapters and a three-stage pre-training regime to enable lightweight yet effective fusion, achieving substantial parameter and WER reductions compared with baselines. Empirically, the approach delivers competitive WER with far fewer parameters (53M) than large AVSR models and demonstrates robustness to noise, underscoring its practicality for resource-constrained deployments. The work highlights the importance of both information completion and purification in cross-modal fusion and paves the way for efficient AVSR in real-world scenarios.

Abstract

Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a pre-training method that further optimizes model performance by selectively updating parameters, leading to significant improvements in efficiency. Unlike conventional models that require the system to independently learn the hierarchical relationship between audio and visual modalities, our approach incorporates this distinction directly into the model architecture. This design enhances both efficiency and performance, resulting in a more practical and effective solution for AVSR tasks.
Paper Structure (14 sections, 3 equations, 3 figures, 2 tables)

This paper contains 14 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overall architecture of the DCIM-AVSR (Dual-Mode Conformer Interaction Model for Audio-Visual Speech Recognition) and the adapter module illustrates the mechanism and flow of cross-modal information interaction.
  • Figure 2: The detailed comparison of the different variants of adapter, focusing on their functionality.
  • Figure 3: WER (%) Comparison on LRS2/LRS3 Under Various Conditions.