Table of Contents
Fetching ...

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

Wenjie Tian, Mingchen Shao, Bingshen Mu, Xuelong Geng, Chengyou Wang, Yujie Liao, Zhixian Zhao, Ziyu Zhang, Jingbin Hu, Mengqi Wei, Lei Xie

TL;DR

This work constructs an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence, which mitigates the single-modality dominance problem and addresses the data scarcity.

Abstract

Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

TL;DR

This work constructs an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence, which mitigates the single-modality dominance problem and addresses the data scarcity.

Abstract

Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.
Paper Structure (16 sections, 5 equations, 2 figures, 3 tables)

This paper contains 16 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of different recognition tasks: ASR, AVSR focusing on lip-reading, and CAVSR focusing on rich visual context. This sample illustrates that the CAVSR task can leverage rich visual context to enhance recognition accuracy.
  • Figure 2: Overview of the VASR framework. (Left) The model architecture achieves CAVSR. A qualitative example illustrates how the model resolves phonetic ambiguities via multimodal reasoning. (Right) The data process pipeline and construction of AV-CoT.