Table of Contents
Fetching ...

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic

TL;DR

BRAVEn extends the RAVEn self-supervised framework by averaging Transformer-block outputs, using asymmetric predictors and masking, and applying uneven loss weights to better align with audio-dominated speech tasks. These design choices yield state-of-the-art results among self-supervised methods for both visual and audio speech recognition, with strong scaling as unlabelled data increases. Notably, BRAVEn-Large trained on thousands of unlabelled hours with only 30 labelled hours achieves 20.0% VSR and 1.7% ASR WER on LRS3, approaching performance of methods trained on far more transcribed data. The work demonstrates that abundant unlabelled audio-visual data can largely substitute costly labeled data, enabling practical deployment in low-resource settings while maintaining competitive performance in high-resource regimes.

Abstract

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

TL;DR

BRAVEn extends the RAVEn self-supervised framework by averaging Transformer-block outputs, using asymmetric predictors and masking, and applying uneven loss weights to better align with audio-dominated speech tasks. These design choices yield state-of-the-art results among self-supervised methods for both visual and audio speech recognition, with strong scaling as unlabelled data increases. Notably, BRAVEn-Large trained on thousands of unlabelled hours with only 30 labelled hours achieves 20.0% VSR and 1.7% ASR WER on LRS3, approaching performance of methods trained on far more transcribed data. The work demonstrates that abundant unlabelled audio-visual data can largely substitute costly labeled data, enabling practical deployment in low-resource settings while maintaining competitive performance in high-resource regimes.

Abstract

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
Paper Structure (16 sections, 2 equations, 1 figure, 4 tables)

This paper contains 16 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: BRAVEn overview. BRAVEn uses targets that are the average of the outputs of the Transformer encoder blocks of the teacher networks. The video student predicts the audio targets via a 1-block predictor. In contrast, the audio student uses 2-block predictors and uses both a cross- as well as a within-modal loss, whose weight is twice as large as that of the cross-modal loss. The input masking for audio is stronger than the masking applied for video.