BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic
TL;DR
BRAVEn extends the RAVEn self-supervised framework by averaging Transformer-block outputs, using asymmetric predictors and masking, and applying uneven loss weights to better align with audio-dominated speech tasks. These design choices yield state-of-the-art results among self-supervised methods for both visual and audio speech recognition, with strong scaling as unlabelled data increases. Notably, BRAVEn-Large trained on thousands of unlabelled hours with only 30 labelled hours achieves 20.0% VSR and 1.7% ASR WER on LRS3, approaching performance of methods trained on far more transcribed data. The work demonstrates that abundant unlabelled audio-visual data can largely substitute costly labeled data, enabling practical deployment in low-resource settings while maintaining competitive performance in high-resource regimes.
Abstract
Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
