DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu; Changyan Zheng; Xingyu Zhang; Yakun Zhang; Chengshi Zheng; Shuang Yang; Ye Yan; Erwei Yin

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu, Changyan Zheng, Xingyu Zhang, Yakun Zhang, Chengshi Zheng, Shuang Yang, Ye Yan, Erwei Yin

TL;DR

The proposed Deep Balanced Multimodal Iterative Fusion Framework effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios.

Abstract

The performance of conventional speech enhancement systems degrades sharply in extremely low signal-to-noise ratio (SNR) environments where air-conduction (AC) microphones are overwhelmed by ambient noise. Although bone-conduction (BC) sensors offer complementary, noise-tolerant information, existing fusion approaches struggle to maintain consistent performance across a wide range of SNR conditions. To address this limitation, we propose the Deep Balanced Multimodal Iterative Fusion Framework (DBMIF), a three-branch architecture designed to reconstruct high-fidelity speech through rigorous cross-modal interaction. Specifically, grounded in a multi-scale interactive encoder-decoder backbone, the framework orchestrates an iterative attention module and a cross-branch gated module to facilitate adaptive weighting and bidirectional exchange. To complement this dynamic interaction, a balanced-interaction bottleneck is further integrated to learn a compact, stable fused representation. Extensive experiments demonstrate that DBMIF achieves competitive performance compared with recent unimodal and multimodal baselines in both speech quality and intelligibility across diverse noise types. In downstream ASR tasks, the proposed method reduces the character error rate by at least 2.5 percent compared to competing approaches. These results confirm that DBMIF effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios. The source code is publicly available at github.com/wyl516w/dbmif.

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

TL;DR

Abstract

Paper Structure (33 sections, 19 equations, 10 figures, 7 tables)

This paper contains 33 sections, 19 equations, 10 figures, 7 tables.

Introduction
Related Work
AC Speech Enhancement
BC Speech Enhancement
Multimodal Feature Fusion
Method
Problem Formulation
Model Overview
Generator
Three-Branch Multi-Scale Interactive Encoder-Decoder
Early Deep Iterative Attention Fusion
Deep Balanced Interaction
Discriminator Architecture
Training Objectives
Discriminator Loss
...and 18 more sections

Figures (10)

Figure 1: Overall architecture of DBMIF.
Figure 2: Diagram of the CBGI module. Solid arrows denote the feature flow, while dashed arrows indicate the gating signals generated to modulate the corresponding streams.
Figure 3: Diagram of the early DIAF module. DIAF progressively refines AC and BC features through iterative channel attention to emphasize reliable modality cues.
Figure 4: Diagram of DBI module. DBI maintains recurrent modality states and iteratively updates them through intra-modal refinement and fusion feedback.
Figure 5: Low-SNR results among different methods.
...and 5 more figures

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

TL;DR

Abstract

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (10)