Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun
TL;DR
CAV2vec tackles robust audio-visual speech recognition under real-world joint corruption by introducing unimodal multi-task corrupted prediction within a self-distillation framework. It leverages ACP and VCP tasks, along with an audio-visual corrupted prediction (AVCP) and a strong masked-prediction component, to align corrupted unimodal representations with clean cross-modal targets. Empirical results on LRS3 and LRS2 with unseen visual/audio corruptions show significant improvements over strong baselines, including under DEMAND environments and pixelation, demonstrating strong generalization. The method remains resource-efficient by uptraining pretrained AV-HuBERT backbones with modest overhead, setting a new standard for robust multimodal speech representations in noisy real-world settings.
Abstract
Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.
