AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli
TL;DR
AV-data2vec presents a fully end-to-end, self-supervised framework for learning joint audio-visual speech representations using a single shared transformer encoder and a teacher-student EMA setup to predict contextualized targets. By combining audio and visual streams through early fusion and a modality scheduler, it trains on nine training tasks and uses targets derived from multiple transformer layers, enabling robust AVSR with limited labeled data. Across low- and high-resource regimes on LRS3 (and VoxCeleb2), AV-data2vec achieves state-of-the-art results in VSR, ASR, and AVSR under equal data and model size, and ablations show benefits from multi-block target averaging and larger batch sizes with more data. The approach narrows the gap to supervised systems while highlighting practical considerations such as hyperparameter sensitivity and the value of stronger visual representations, suggesting directions like improved visual encoders and articulatory signals for further gains.
Abstract
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.
