Table of Contents
Fetching ...

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun

TL;DR

CAV2vec tackles robust audio-visual speech recognition under real-world joint corruption by introducing unimodal multi-task corrupted prediction within a self-distillation framework. It leverages ACP and VCP tasks, along with an audio-visual corrupted prediction (AVCP) and a strong masked-prediction component, to align corrupted unimodal representations with clean cross-modal targets. Empirical results on LRS3 and LRS2 with unseen visual/audio corruptions show significant improvements over strong baselines, including under DEMAND environments and pixelation, demonstrating strong generalization. The method remains resource-efficient by uptraining pretrained AV-HuBERT backbones with modest overhead, setting a new standard for robust multimodal speech representations in noisy real-world settings.

Abstract

Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

TL;DR

CAV2vec tackles robust audio-visual speech recognition under real-world joint corruption by introducing unimodal multi-task corrupted prediction within a self-distillation framework. It leverages ACP and VCP tasks, along with an audio-visual corrupted prediction (AVCP) and a strong masked-prediction component, to align corrupted unimodal representations with clean cross-modal targets. Empirical results on LRS3 and LRS2 with unseen visual/audio corruptions show significant improvements over strong baselines, including under DEMAND environments and pixelation, demonstrating strong generalization. The method remains resource-efficient by uptraining pretrained AV-HuBERT backbones with modest overhead, setting a new standard for robust multimodal speech representations in noisy real-world settings.

Abstract

Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.

Paper Structure

This paper contains 40 sections, 5 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: (a) Real-world speech recognition challenges. AVSR models suffer from maintaining robust representations under the corrupted environments and fail to recognize utterances. (b) Our corrupted representation learning strategies with multimodal and unimodal corrupted prediction tasks. (c) Speech recognition accuracy (100 $-$ WER %), where frequency denotes the number of visual corruption events in a sequence. Our representation learning framework, CAV2vec with a unimodal strategy (U), significantly improves robustness compared to the baseline model and even outperforms the multimodal strategy (M).
  • Figure 2: The visual and audio corruption types we use in our training and evaluation phases. Unseen corruption types are only utilized in evaluation to assess the model's generalizability. The speech audio noise from LRS3 is ensured that there is no speaker overlap between train and evaluation sets.
  • Figure 3: Overview of our representation learning framework with corrupted prediction tasks. For the corrupted prediction strategies, focusing on the cross-modal alignment through unimodal multi-task learning proves highly effective in gaining multimodal robustness.
  • Figure 4: Similarity scores measured between audio-visual features of sample sequences. Clean sequence representations are compared with corrupted ones from (a) AV-data2vec and (b) CAV2vec. The normalized L2 distance $\overline{d}$ is calculated between the clean and corrupted features per-sample.
  • Figure 5: Our implemented strategies for corrupted prediction tasks. The AVCP task uses the audio-visual targets. For the multi-task learning (MTL) designs that utilize unimodal targets, mACP and mVCP tasks use multimodal inputs (mMTL), while ACP and VCP tasks use unimodal inputs (uMTL).
  • ...and 1 more figures