Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
TL;DR
The paper tackles far-field multichannel speech recognition under data scarcity by introducing AV-wav2vec2, a multichannel audio-visual self-supervised framework that uses intra- and inter-channel contrastive losses to exploit spatial cues. It integrates a multichannel AV branch with a single-channel branch and leverages additional unlabeled audio data to strengthen representations, achieving improvements on AVSR, ASR, VSR, and AVSD compared with beamforming and English-language AV models. Key contributions include the dual-contrastive pre-training losses, a practical architecture that handles up to six channels, and demonstrations on a Chinese Mandarin dataset showing robustness in noisy real-world scenarios. The approach offers a scalable avenue for improving multimodal speech processing in far-field conditions and under data-limited settings, with implications for robust audiovisual diarization and recognition in challenging environments.
Abstract
Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.
