Table of Contents
Fetching ...

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Pol Buitrago, Pol Gàlvez, Oriol Pareras, Javier Hernando

TL;DR

A zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio, which achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise.

Abstract

Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

TL;DR

A zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio, which achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise.

Abstract

Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Pipeline for generating synthetic audiovisual data from audio-only corpora and facial images. Face images are selected and animated with lip movements synchronized to the speech using a pre-trained Wav2Lip+GAN model 10.1145/3394171.3413532gan, producing coherent talking-head videos.
  • Figure 2: Semi-automatic pipeline for annotating the Catalan AV benchmark. The process combines segment extraction, morphological filtering, automatic pseudo-labeling radford2022robustspeechrecognitionlargescale, and manual verification and refinement via a custom graphical interface.
  • Figure 3: WER (%) comparison under additive white Gaussian noise at different SNR levels.