Table of Contents
Fetching ...

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

TL;DR

The experimental results on the VoxCeleb2 dataset show that the proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction.

Abstract

Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

TL;DR

The experimental results on the VoxCeleb2 dataset show that the proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction.

Abstract

Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.
Paper Structure (19 sections, 3 equations, 3 figures, 2 tables)

This paper contains 19 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overall architecture of the proposed AVHuMAR-TSE system. The system (a) on the left is the basic AVHuMAR-TSE system without MAR block, which is called the AVHuBERT-TSE system. The system (b) on the right is the complete AVHuMAR-TSE system.
  • Figure 2: The Speaker Extractor will be repeated R times. For $r-th$ Speaker Extractor, it mainly contains a Cue Encoder to refine the target speaker's visual cue $V^r_{(t)}$ and a Mask Estimator to predict the target speech mask $M^r_{(t)}$. The Speech Decoder and Speech Encoder are utilized to reconstruct and encode the intermediate estimated speech $\hat{S}^r_{(t)}$, respectively. Note that the initial target speech mask $M^0_{(t)}$ will be predicted conditioned on the initial visual cue $V^0_{(t)}$ and mixture speech embedding $X^0_{(t)}$.
  • Figure 3: Comparison of target speech spectrograms extracted by AVHuMAR-TSE system and MuSE system.