Table of Contents
Fetching ...

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, Shinji Watanabe

TL;DR

This work tackles audiovisual speech recognition (AV-ASR) in unconstrained real-world videos, where noisy audio, spontaneous speech, and diverse visual usage hinder accuracy. It introduces Bifocal Preference Optimization (BPO-AVASR), a training framework that leverages both input-side (audio, video) and output-side (transcripts) preferences to better align AV-ASR models with real-world error modes. By constructing a bifocal preference dataset and using a two-stage pipeline of supervised fine-tuning followed by preference-based optimization, the method yields state-of-the-art performance across How2, VisSpeech, and Ego4D datasets, often with less audiovisual data than prior approaches. The approach shows robust generalization to diverse domains and provides a practical path toward more reliable AV-ASR in real-world settings, supported by detailed ablations and qualitative analyses. Overall, BPO-AVASR demonstrates that explicitly optimizing for bifocal preferences can significantly enhance AV-ASR in noisy, spontaneous, and visually diverse environments, with implications for deployment in real-world video applications.

Abstract

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

TL;DR

This work tackles audiovisual speech recognition (AV-ASR) in unconstrained real-world videos, where noisy audio, spontaneous speech, and diverse visual usage hinder accuracy. It introduces Bifocal Preference Optimization (BPO-AVASR), a training framework that leverages both input-side (audio, video) and output-side (transcripts) preferences to better align AV-ASR models with real-world error modes. By constructing a bifocal preference dataset and using a two-stage pipeline of supervised fine-tuning followed by preference-based optimization, the method yields state-of-the-art performance across How2, VisSpeech, and Ego4D datasets, often with less audiovisual data than prior approaches. The approach shows robust generalization to diverse domains and provides a practical path toward more reliable AV-ASR in real-world settings, supported by detailed ablations and qualitative analyses. Overall, BPO-AVASR demonstrates that explicitly optimizing for bifocal preferences can significantly enhance AV-ASR in noisy, spontaneous, and visually diverse environments, with implications for deployment in real-world video applications.

Abstract

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.

Paper Structure

This paper contains 28 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our proposed framework. We construct bifocal preferences by augmenting transcript (T), audio (A) or video (V), resulting in two types of preference pairs: input-side (A and V) and output-side (T). Through bifocal preference optimization, BPO-AVASR learns to generate transcripts that are better aligned with these preferences.
  • Figure 2: The illustration of preference dataset constructing strategies: (a) Input-side preference construction: simulates errors by manipulating input audios and videos; (b) Output-side preference construction: simulates common errors by manipulating ground truth transcripts.
  • Figure 3: Qualitative Results. We show the ground truth text (GT), and predictions from the OWSM-visual small(w/o DPO) and BPO-AVASR small. We show the enhanced performance in three scenarios: when vision provides content cues (top), when vision offers context clues (middle), and when speech is under spontaneous scenarios (bottom). Errors in the predicted words compared to the GT are highlighted in red. Faces are blurred for privacy.
  • Figure 4: Examples of generated dense captions using ShareCaptioner-Video.