Table of Contents
Fetching ...

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Jiarong Du, Zhan Jin, Peijun Yang, Juan Liu, Zhuo Li, Xin Liu, Ming Li

TL;DR

This work tackles audio-visual speech enhancement in real-world, reverberant, and noisy environments by proposing a separation-before-dereverberation framework that leverages multimodal cues. The method builds on TFGridNet, fusing lip-reading and facial-expression features into the separation network and employing progressive training with intermediate targets. A post-processing dereverberation module (initially diffusion-based, later replaced by SkipConvNet for joint training) enhances the separated output, with a joint loss guiding end-to-end optimization. Experiments on the AVSEC-4 dataset show strong subjective intelligibility and competitive objective metrics, demonstrating robustness in complex acoustic scenarios and achieving state-of-the-art performance in the competition.

Abstract

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

TL;DR

This work tackles audio-visual speech enhancement in real-world, reverberant, and noisy environments by proposing a separation-before-dereverberation framework that leverages multimodal cues. The method builds on TFGridNet, fusing lip-reading and facial-expression features into the separation network and employing progressive training with intermediate targets. A post-processing dereverberation module (initially diffusion-based, later replaced by SkipConvNet for joint training) enhances the separated output, with a joint loss guiding end-to-end optimization. Experiments on the AVSEC-4 dataset show strong subjective intelligibility and competitive objective metrics, demonstrating robustness in complex acoustic scenarios and achieving state-of-the-art performance in the competition.

Abstract

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

Paper Structure

This paper contains 10 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overall Model Architecture. Lip movement and facial expression features are extracted by pre-trained lip-reading model martinez2020lipreading and expression estimation model Roy2024ResEmoteNetBA, respectively. These three types of features are concatenated along the channel dimension and then fed into the separation module following the same way as in jin2025Robust. The separation module adopts the same structure as that of wang2023tf.
  • Figure 2: Progressive Training. In this work, K is set to 6. $\text{E}_k$ represents the intermediate estimate output by the $k$-th separation module layer. $\text{Mixture}_k$ denotes the intermediate target at the $k$-th layer. Target refers to the clean target speaker's speech with reverberation.
  • Figure 3: Results of subjective evaluation by human auditory perception. Our system is represented by "U".