Table of Contents
Fetching ...

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

TL;DR

The paper addresses noise-robust speech perception across many languages by leveraging cross-lingual audio-visual signals. It introduces XLAVS-R, a cross-lingual AV SSL model built on scalable audio-only pre-training with visual modality injection, a learnable audio feature extractor, and a single-round AV training scheme. Empirically, XLAVS-R achieves state-of-the-art results on MuAViC for AVSR and AVS2TT, exhibits strong noise robustness, and demonstrates notable zero-shot AV transfer when downstream fine-tuning uses audio-only data. The work shows that multilingual audio-only data, when complemented with selective visual integration and efficient training, yields broad language coverage and robustness with reduced reliance on labeled AV data, while scaling AV pre-training data enhances domain adaptation.

Abstract

Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

TL;DR

The paper addresses noise-robust speech perception across many languages by leveraging cross-lingual audio-visual signals. It introduces XLAVS-R, a cross-lingual AV SSL model built on scalable audio-only pre-training with visual modality injection, a learnable audio feature extractor, and a single-round AV training scheme. Empirically, XLAVS-R achieves state-of-the-art results on MuAViC for AVSR and AVS2TT, exhibits strong noise robustness, and demonstrates notable zero-shot AV transfer when downstream fine-tuning uses audio-only data. The work shows that multilingual audio-only data, when complemented with selective visual integration and efficient training, yields broad language coverage and robustness with reduced reliance on labeled AV data, while scaling AV pre-training data enhances domain adaptation.

Abstract

Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
Paper Structure (18 sections, 1 equation, 3 figures, 10 tables)

This paper contains 18 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of XLAVS-R. From the audio-only SSL model, we generate unit targets for audio-visual SSL pre-training (left box). We inject visual modality into the first stage model (blue blocks) and fuse visual modality with audio one (pink blocks) to continue training with audio-visual SSL pre-training. In AV SSL, noises are added randomly to clean audio, and masked prediction objectives (right top) are applied to the union set of masked frames of audio and visual stream (bottom gray).
  • Figure 2: Effectiveness of each component towards XLAVS-R and multilingual pre-training data starting from AV-HuBERT model pre-trained only on MuAViC-En. All the components of XLAVS-R are shown to be effective. Each ablated pre-trained models are fine-tuned and evaluated on multilingual audio-visual speech recognition with identical training and test settings (A: audio, AV: audio+video). The numbers of the plots are in Appendix, Table \ref{['table:avhubert2']}.
  • Figure 3: XLAVS-R shows greater zero-shot ability on audio-visual test mode with audio-only fine-tuned (FT, striped) model compared to that of AV-HuBERT. Without fine-tuned on the labeled audio-visual set, the A-only FT model from XLAVS-R shows 5% WER improvement on AV test mode compared to A test mode (purple), while that of AV-HuBERT shows only 3% WER (orange) in a noisy environment. The bigger the XLAVS-R model size, the greater the zero-shot ability---the gap of 9.8% WER between A and AV test mode of A-only FT 2B model in a noisy environment). Values of individual languages are in Appendix, Table \ref{['table:xlsr_xlavsr']}.