Table of Contents
Fetching ...

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

TL;DR

AV-data2vec presents a fully end-to-end, self-supervised framework for learning joint audio-visual speech representations using a single shared transformer encoder and a teacher-student EMA setup to predict contextualized targets. By combining audio and visual streams through early fusion and a modality scheduler, it trains on nine training tasks and uses targets derived from multiple transformer layers, enabling robust AVSR with limited labeled data. Across low- and high-resource regimes on LRS3 (and VoxCeleb2), AV-data2vec achieves state-of-the-art results in VSR, ASR, and AVSR under equal data and model size, and ablations show benefits from multi-block target averaging and larger batch sizes with more data. The approach narrows the gap to supervised systems while highlighting practical considerations such as hyperparameter sensitivity and the value of stronger visual representations, suggesting directions like improved visual encoders and articulatory signals for further gains.

Abstract

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

TL;DR

AV-data2vec presents a fully end-to-end, self-supervised framework for learning joint audio-visual speech representations using a single shared transformer encoder and a teacher-student EMA setup to predict contextualized targets. By combining audio and visual streams through early fusion and a modality scheduler, it trains on nine training tasks and uses targets derived from multiple transformer layers, enabling robust AVSR with limited labeled data. Across low- and high-resource regimes on LRS3 (and VoxCeleb2), AV-data2vec achieves state-of-the-art results in VSR, ASR, and AVSR under equal data and model size, and ablations show benefits from multi-block target averaging and larger batch sizes with more data. The approach narrows the gap to supervised systems while highlighting practical considerations such as hyperparameter sensitivity and the value of stronger visual representations, suggesting directions like improved visual encoders and articulatory signals for further gains.

Abstract

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.
Paper Structure (21 sections, 3 equations, 3 figures, 4 tables)

This paper contains 21 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: AV-data2vec jointly encodes both audio and visual data to build audio-visual representations. The student model encodes a masked version of both audio and visual data and predicts a contextualized target representation created by a teacher model which is based on the unmasked version of the training sample. Target representations encode both high-level and low-level features from multiple layers of the teacher model.
  • Figure 2: Effect of averaging $K$ blocks to create contextualized target representations. More blocks improve performance because targets become richer due to including both high-level and low-level features. Results are based on a Base model pretrained on 433h of unlabeled data and finetuned on 30h of labeled data.
  • Figure 3: AV-data2vec performs better than audio-only training (A-data2vec) in all ASR settings.