Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, Zhen-Hua Ling
TL;DR
This work tackles robust audio-visual representation learning by transferring knowledge from large-scale speech foundation models (SFMs) to a joint audio-visual student. It introduces a cross-modal knowledge distillation framework that distills multi-layer teacher representations from SFMs into a single student and leverages a soft-label KL loss combined with a feature-regression loss, including a multi-teacher ensemble to improve generalization. The method achieves superior or comparable performance on ASR, VSR, and AVSR across clean and noisy conditions, with ablations confirming the benefits of multi-teacher ensembles, soft-label distillation, and joint audio-visual modeling. The results demonstrate that SFMs encode rich linguistic structure that can be transferred to the visual domain, enabling robust AV representations with reduced labeled data requirements and flexible finetuning.
Abstract
Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.
