Table of Contents
Fetching ...

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai

TL;DR

The paper tackles far-field multichannel speech recognition under data scarcity by introducing AV-wav2vec2, a multichannel audio-visual self-supervised framework that uses intra- and inter-channel contrastive losses to exploit spatial cues. It integrates a multichannel AV branch with a single-channel branch and leverages additional unlabeled audio data to strengthen representations, achieving improvements on AVSR, ASR, VSR, and AVSD compared with beamforming and English-language AV models. Key contributions include the dual-contrastive pre-training losses, a practical architecture that handles up to six channels, and demonstrations on a Chinese Mandarin dataset showing robustness in noisy real-world scenarios. The approach offers a scalable avenue for improving multimodal speech processing in far-field conditions and under data-limited settings, with implications for robust audiovisual diarization and recognition in challenging environments.

Abstract

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

TL;DR

The paper tackles far-field multichannel speech recognition under data scarcity by introducing AV-wav2vec2, a multichannel audio-visual self-supervised framework that uses intra- and inter-channel contrastive losses to exploit spatial cues. It integrates a multichannel AV branch with a single-channel branch and leverages additional unlabeled audio data to strengthen representations, achieving improvements on AVSR, ASR, VSR, and AVSD compared with beamforming and English-language AV models. Key contributions include the dual-contrastive pre-training losses, a practical architecture that handles up to six channels, and demonstrations on a Chinese Mandarin dataset showing robustness in noisy real-world scenarios. The approach offers a scalable avenue for improving multimodal speech processing in far-field conditions and under data-limited settings, with implications for robust audiovisual diarization and recognition in challenging environments.

Abstract

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.
Paper Structure (16 sections, 4 equations, 4 figures, 4 tables)

This paper contains 16 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The structure of our proposed multichannel multi-modal speech self-supervised pre-training framework.
  • Figure 2: The overall model structure for the downstream AVSR, ASR, and VSR tasks.
  • Figure 3: The overall model structure for the downstream AVSD task.
  • Figure 4: Decoding examples of the proposed AV-wav2vec2 model and the end-to-end supervised model (without pre-training), where GT and red characters denote the ground-truth and wrong results in the output, respectively.