Table of Contents
Fetching ...

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu, Changyan Zheng, Xingyu Zhang, Yakun Zhang, Chengshi Zheng, Shuang Yang, Ye Yan, Erwei Yin

TL;DR

The proposed Deep Balanced Multimodal Iterative Fusion Framework effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios.

Abstract

The performance of conventional speech enhancement systems degrades sharply in extremely low signal-to-noise ratio (SNR) environments where air-conduction (AC) microphones are overwhelmed by ambient noise. Although bone-conduction (BC) sensors offer complementary, noise-tolerant information, existing fusion approaches struggle to maintain consistent performance across a wide range of SNR conditions. To address this limitation, we propose the Deep Balanced Multimodal Iterative Fusion Framework (DBMIF), a three-branch architecture designed to reconstruct high-fidelity speech through rigorous cross-modal interaction. Specifically, grounded in a multi-scale interactive encoder-decoder backbone, the framework orchestrates an iterative attention module and a cross-branch gated module to facilitate adaptive weighting and bidirectional exchange. To complement this dynamic interaction, a balanced-interaction bottleneck is further integrated to learn a compact, stable fused representation. Extensive experiments demonstrate that DBMIF achieves competitive performance compared with recent unimodal and multimodal baselines in both speech quality and intelligibility across diverse noise types. In downstream ASR tasks, the proposed method reduces the character error rate by at least 2.5 percent compared to competing approaches. These results confirm that DBMIF effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios. The source code is publicly available at github.com/wyl516w/dbmif.

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

TL;DR

The proposed Deep Balanced Multimodal Iterative Fusion Framework effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios.

Abstract

The performance of conventional speech enhancement systems degrades sharply in extremely low signal-to-noise ratio (SNR) environments where air-conduction (AC) microphones are overwhelmed by ambient noise. Although bone-conduction (BC) sensors offer complementary, noise-tolerant information, existing fusion approaches struggle to maintain consistent performance across a wide range of SNR conditions. To address this limitation, we propose the Deep Balanced Multimodal Iterative Fusion Framework (DBMIF), a three-branch architecture designed to reconstruct high-fidelity speech through rigorous cross-modal interaction. Specifically, grounded in a multi-scale interactive encoder-decoder backbone, the framework orchestrates an iterative attention module and a cross-branch gated module to facilitate adaptive weighting and bidirectional exchange. To complement this dynamic interaction, a balanced-interaction bottleneck is further integrated to learn a compact, stable fused representation. Extensive experiments demonstrate that DBMIF achieves competitive performance compared with recent unimodal and multimodal baselines in both speech quality and intelligibility across diverse noise types. In downstream ASR tasks, the proposed method reduces the character error rate by at least 2.5 percent compared to competing approaches. These results confirm that DBMIF effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios. The source code is publicly available at github.com/wyl516w/dbmif.
Paper Structure (33 sections, 19 equations, 10 figures, 7 tables)

This paper contains 33 sections, 19 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overall architecture of DBMIF.
  • Figure 2: Diagram of the CBGI module. Solid arrows denote the feature flow, while dashed arrows indicate the gating signals generated to modulate the corresponding streams.
  • Figure 3: Diagram of the early DIAF module. DIAF progressively refines AC and BC features through iterative channel attention to emphasize reliable modality cues.
  • Figure 4: Diagram of DBI module. DBI maintains recurrent modality states and iteratively updates them through intra-modal refinement and fusion feedback.
  • Figure 5: Low-SNR results among different methods.
  • ...and 5 more figures