Table of Contents
Fetching ...

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Fei Su, Cancan Li, Juan Liu, Wei Ju, Hongbin Suo, Ming Li

TL;DR

This work proposes AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement.

Abstract

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

TL;DR

This work proposes AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement.

Abstract

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.
Paper Structure (15 sections, 5 equations, 2 figures, 3 tables)

This paper contains 15 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of AVUR-LLM. (a) The Sparse Modality Alignment (SMA) module, audio-conditioned cross-attention sparsely inserted into the upper audio encoder layers; (b) The Adaptive Modulated Fusion (AMF) module, confidence-aware gating that adaptively modulates visual injection in the decoder; (c) The Visual Unit-Guided Refinement (VUR) module, discretizes mid-layer visual features into tokens for LLM rescoring. Training has two stages: Stage 1 performs SMA+AMF and outputs $N$-best hypotheses; Stage 2 extracts and compresses $X_v^{\ell}$ into visual discrete tokens and uses a LoRA-adapted LLM to rescore the hypotheses. Here $\mathbf{X}_{a}$ and $\mathbf{X}_{v}$ denote audio and visual encoder features. $\hat{\mathbf{X}}_{v}$ denotes the SMA-refined visual features, $\mathbf{X}_{v}^{\ell}$ denotes visual features extracted from encoder layer $\ell$. $\mathbf{X}_{a}^{\mathrm{sg}}$ denotes the audio features serving as keys/values with stop-gradient. $g_{\mathrm{AMF}}$ denotes the AMF gate coefficients.
  • Figure 2: Effect of visual discrete tokens extraction depth and codebook size on WER (%). Solid lines denote clean, dashed lines denote 0 dB SNR. Blue curves denote K-means codebook size $K{=}1000$, red curves denote $K{=}2000$.