Table of Contents
Fetching ...

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, Zhen-Hua Ling

TL;DR

This work tackles robust audio-visual representation learning by transferring knowledge from large-scale speech foundation models (SFMs) to a joint audio-visual student. It introduces a cross-modal knowledge distillation framework that distills multi-layer teacher representations from SFMs into a single student and leverages a soft-label KL loss combined with a feature-regression loss, including a multi-teacher ensemble to improve generalization. The method achieves superior or comparable performance on ASR, VSR, and AVSR across clean and noisy conditions, with ablations confirming the benefits of multi-teacher ensembles, soft-label distillation, and joint audio-visual modeling. The results demonstrate that SFMs encode rich linguistic structure that can be transferred to the visual domain, enabling robust AV representations with reduced labeled data requirements and flexible finetuning.

Abstract

Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

TL;DR

This work tackles robust audio-visual representation learning by transferring knowledge from large-scale speech foundation models (SFMs) to a joint audio-visual student. It introduces a cross-modal knowledge distillation framework that distills multi-layer teacher representations from SFMs into a single student and leverages a soft-label KL loss combined with a feature-regression loss, including a multi-teacher ensemble to improve generalization. The method achieves superior or comparable performance on ASR, VSR, and AVSR across clean and noisy conditions, with ablations confirming the benefits of multi-teacher ensembles, soft-label distillation, and joint audio-visual modeling. The results demonstrate that SFMs encode rich linguistic structure that can be transferred to the visual domain, enabling robust AV representations with reduced labeled data requirements and flexible finetuning.

Abstract

Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of overall scheme of our proposed method during the pretraining and finetuning phases.
  • Figure 2: Illustration of our proposed method during the pretraining phase.
  • Figure 3: WER (%) results on the test sets corrupted by various types of noise. "A" and "AV" represent models using audio and audio-visual inputs respectively.
  • Figure 4: WER (%) results on our validation set as a function of $k$ (number of last teacher layers averaged) for WavLM and iFLYTEK-speech based teachers.
  • Figure 5: Distance matrices between phonemes using their corresponding representations from the student. Hierarchical clustering was applied to cluster phonemes based on their distances. "Audio" and "Video" refer to phoneme representations extracted from audio and visual inputs respectively. "Layer N" indicates representations from the N-th layer of the student model.
  • ...and 1 more figures