Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Fei Su; Cancan Li; Juan Liu; Wei Ju; Hongbin Suo; Ming Li

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Fei Su, Cancan Li, Juan Liu, Wei Ju, Hongbin Suo, Ming Li

TL;DR

This work proposes AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement.

Abstract

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 2 figures, 3 tables)

This paper contains 15 sections, 5 equations, 2 figures, 3 tables.

Introduction
Methodology
Sparse Modality Alignment
Adaptive Modulated Fusion
Visual Unit-Guided Refinement
Experimental Results
Experimental Settings
Training and Inference Details
Experimental Results and Analysis
Main Results
Noise Robustness Evaluation
Ablation Results
Further Analysis
Conclusion
Generative AI Use Disclosure

Figures (2)

Figure 1: Illustration of AVUR-LLM. (a) The Sparse Modality Alignment (SMA) module, audio-conditioned cross-attention sparsely inserted into the upper audio encoder layers; (b) The Adaptive Modulated Fusion (AMF) module, confidence-aware gating that adaptively modulates visual injection in the decoder; (c) The Visual Unit-Guided Refinement (VUR) module, discretizes mid-layer visual features into tokens for LLM rescoring. Training has two stages: Stage 1 performs SMA+AMF and outputs $N$-best hypotheses; Stage 2 extracts and compresses $X_v^{\ell}$ into visual discrete tokens and uses a LoRA-adapted LLM to rescore the hypotheses. Here $\mathbf{X}_{a}$ and $\mathbf{X}_{v}$ denote audio and visual encoder features. $\hat{\mathbf{X}}_{v}$ denotes the SMA-refined visual features, $\mathbf{X}_{v}^{\ell}$ denotes visual features extracted from encoder layer $\ell$. $\mathbf{X}_{a}^{\mathrm{sg}}$ denotes the audio features serving as keys/values with stop-gradient. $g_{\mathrm{AMF}}$ denotes the AMF gate coefficients.
Figure 2: Effect of visual discrete tokens extraction depth and codebook size on WER (%). Solid lines denote clean, dashed lines denote 0 dB SNR. Blue curves denote K-means codebook size $K{=}1000$, red curves denote $K{=}2000$.

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

TL;DR

Abstract

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (2)