Table of Contents
Fetching ...

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

TL;DR

This work tackles the data scarcity and modality gap in visual acoustic matching (VAM) and acoustic dereverberation by proposing MVSD, a mutual-learning framework built on diffusion models. MVSD uses two symmetric converters—a reverberator and a dereverberator—conditioned on visual scenes and connected in a closed loop to provide cycle-consistency feedback, enabling learning from both paired and unpaired data. The approach leverages visual-scene-driven diffusion with a controllable Unet and cross-modal attention, achieving end-to-end training with a combined loss that includes diffusion, mutual-learning, and style-consistency terms. Empirical results on SoundSpaces-Speech and Acoustic AVSpeech demonstrate state-of-the-art gains in VAM and dereverberation metrics, with ablations confirming the benefits of diffusion over GANs and the value of unpaired data for robustness and generalization.

Abstract

Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios.

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

TL;DR

This work tackles the data scarcity and modality gap in visual acoustic matching (VAM) and acoustic dereverberation by proposing MVSD, a mutual-learning framework built on diffusion models. MVSD uses two symmetric converters—a reverberator and a dereverberator—conditioned on visual scenes and connected in a closed loop to provide cycle-consistency feedback, enabling learning from both paired and unpaired data. The approach leverages visual-scene-driven diffusion with a controllable Unet and cross-modal attention, achieving end-to-end training with a combined loss that includes diffusion, mutual-learning, and style-consistency terms. Empirical results on SoundSpaces-Speech and Acoustic AVSpeech demonstrate state-of-the-art gains in VAM and dereverberation metrics, with ablations confirming the benefits of diffusion over GANs and the value of unpaired data for robustness and generalization.

Abstract

Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios.
Paper Structure (16 sections, 10 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: There exists an inherent reciprocity between VAM and dereverberation. Unlike previous approaches that treat these two tasks independently, our framework simultaneously handles the both tasks. Forming a closed loop between the two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-sided unpaired data (§\ref{['sec:intro']}).
  • Figure 2: The overview of MVSD. The output of a converter can serve as pseudo-input for the reverse task, providing an intermediate transition. Concretely, the reverberator $f_\theta$ and dereverberator $g_\phi$ can generate feedback signals $\mathcal{L}_{m}$ (Eq. \ref{['eq:6']}) for mutual optimization of training, even with one-way unpaired data ($\bm{a}'_r, \bm{v}'$) (§\ref{['sec:MVD']}).
  • Figure 3: The diffusion and denoising processes of VSD. Taking VAM as an example, MVSD converts anechoic audio $\bm{a}_c$ into reverberant audio $\hat{\bm{a}}_r$ that aligns with the acoustics of the visual scene $\bm{v}$ (§\ref{['sec:DD']}).
  • Figure 4: Quantitative dereverberation results on SoundSpaces-Speech chen22vam (§\ref{['PD']}).
  • Figure 5: User study results. X%/Y% means that X% of participants prefer this method while Y% prefer MVSD (§\ref{['exp:US']}).
  • ...and 7 more figures