Table of Contents
Fetching ...

Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Rui Liu, Hongyu Yuan, Haizhou Li

TL;DR

This work introduces AVGER, a novel Generative Error Correction framework for Audio-Visual Speech Recognition that integrates an independent Multimodal Synchronous Encoder and a Cross-modal Prompt to enable LLM-based correction using both audio and lip-video information. By segmenting temporally aligned audio-visual features and guiding a large language model with a structured prompt, AVGER learns a robust H2T mapping while aligning multimodal representations through a Multi-Level Consistency Constraint training objective comprising CMD, WER, and CE losses. Experiments on the LRS3 dataset show AVGER achieving substantial WER reductions (average ~$11.8\%$ WER and up to $24\%$ WERR) across noisy conditions, outperforming GER and UADF baselines. The results demonstrate the practical potential of multimodal error correction for AVSR, with implications for real-world robust speech understanding in noisy environments and future expansion to diverse languages and noise types.

Abstract

Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.

Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

TL;DR

This work introduces AVGER, a novel Generative Error Correction framework for Audio-Visual Speech Recognition that integrates an independent Multimodal Synchronous Encoder and a Cross-modal Prompt to enable LLM-based correction using both audio and lip-video information. By segmenting temporally aligned audio-visual features and guiding a large language model with a structured prompt, AVGER learns a robust H2T mapping while aligning multimodal representations through a Multi-Level Consistency Constraint training objective comprising CMD, WER, and CE losses. Experiments on the LRS3 dataset show AVGER achieving substantial WER reductions (average ~ WER and up to WERR) across noisy conditions, outperforming GER and UADF baselines. The results demonstrate the practical potential of multimodal error correction for AVSR, with implications for real-world robust speech understanding in noisy environments and future expansion to diverse languages and noise types.

Abstract

Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.
Paper Structure (27 sections, 13 equations, 2 figures, 4 tables)

This paper contains 27 sections, 13 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Architecture of the AVGER System. The AVGER system integrates speech and lip video inputs for improved transcription accuracy. It consists of: 1) Audio-Visual Speech Recognition; 2) Audio-Visual Temporal Synchronisation; 3) Cross-modal Prompt; and 4) Multi-Level Consistency Constraint Training.
  • Figure 2: The workflow of Q-Former. $e_k$ is a clip of segmented frame-level features, $P_k$ is segment-level position embeddings, and $\oplus$ denotes the vector addition operation.