DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

Tianyu Cao; Helin Wang; Ari Frummer; Yuval Sieradzki; Adi Arbel; Laureano Moro Velazquez; Jesus Villalba; Oren Gal; Thomas Thebaud; Najim Dehak

DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

Tianyu Cao, Helin Wang, Ari Frummer, Yuval Sieradzki, Adi Arbel, Laureano Moro Velazquez, Jesus Villalba, Oren Gal, Thomas Thebaud, Najim Dehak

Abstract

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.

DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

Abstract

Paper Structure (31 sections, 11 equations, 5 figures, 7 tables)

This paper contains 31 sections, 11 equations, 5 figures, 7 tables.

Introduction
Background
Flow Matching for Generative Modeling
Mixture-of-LoRA Experts
Low-Rank Adapters.
Mixture-of-Experts.
Mixture of LoRA Experts
The Still-SonicSet Dataset
DiT-Flow Speech Enhancement
Overall pipeline
Audio compressor
Flow Matching Module
Extension of Mixture of LoRA Experts for domain adaptation
Experimental Setup
Dataset
...and 16 more sections

Figures (5)

Figure 1: The procedure to generate StillSonicSet. In each scene, three moving RIRs for each speaker in the original SonicSet were discretized to obtain RIR at some fixed places (circles).
Figure 2: The audio compressor architecture with encoder (orange) and decoder (blue) in detail.
Figure 3: The model backbones of the target extractor with Diffusion Transformer backbone (yellow) and uDiT block (grey).
Figure 4: Diagram of the Mixture of LoRA Experts in DiT-Flow model for data adaptation. (a) illustrates the basic Mixture of LoRA Experts mechanism. (b) In a UDiT block, replace the standard adaptation path in both the MHSA and MLP sublayers with Mixture of LoRA Experts modules (blue arrows), while keeping the original normalization and modulation structure. (c) The MLP modification. (d) The MHSA modification.
Figure 5: The extentions of MoELoRA module with only training one single expert.

DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

Abstract

DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

Authors

Abstract

Table of Contents

Figures (5)