VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Jaemin Jung; Junseok Ahn; Chaeyoung Jung; Tan Dat Nguyen; Youngjoon Jang; Joon Son Chung

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

TL;DR

VoiceDiT tackles environment-aware speech synthesis by learning to generate speech that matches textual prompts while reflecting acoustic environments, including noisy or reverberant spaces. It introduces a three-component pipeline: synthetic data pre-training, a Dual-Condition Diffusion Transformer with cross-attention for environment conditioning, and an image-to-audio translator that maps visual prompts to audio embeddings. The model uses a TTS module for text alignment and a Latent Mapper to keep computations practical, optimized with the diffusion loss $L_{diff}$ and related TTS losses; it is fine-tuned on real-world AudioSet-speech. Experiments show VoiceDiT outperforms prior methods like VoiceLDM on real data, achieving better MOS, intelligibility, and cross-modal alignment, and can generate audio from both text and images. This work enables more realistic, context-aware audio for media workflows.

Abstract

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

TL;DR

and related TTS losses; it is fine-tuned on real-world AudioSet-speech. Experiments show VoiceDiT outperforms prior methods like VoiceLDM on real data, achieving better MOS, intelligibility, and cross-modal alignment, and can generate audio from both text and images. This work enables more realistic, context-aware audio for media workflows.

Abstract

Paper Structure (14 sections, 2 equations, 1 figure, 3 tables)

This paper contains 14 sections, 2 equations, 1 figure, 3 tables.

Introduction
Method
Data Preparation
Model Architecture
I2A-Translator
Training
Inference
Experiments
Experimental Setup
Results
Comparison with State-of-the-arts
Ablation Study
X-to-Audio Capabilities
Conclusion

Figures (1)

Figure 1: Model architecture of VoiceDiT. VoiceDiT consists of a TTS module and a Dual-DiT model. A cross-attention module is integrated into each DiT block to inject environmental conditions. "D.P" stands for Duration Predictor.

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

TL;DR

Abstract

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)