Table of Contents
Fetching ...

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

TL;DR

VoiceDiT tackles environment-aware speech synthesis by learning to generate speech that matches textual prompts while reflecting acoustic environments, including noisy or reverberant spaces. It introduces a three-component pipeline: synthetic data pre-training, a Dual-Condition Diffusion Transformer with cross-attention for environment conditioning, and an image-to-audio translator that maps visual prompts to audio embeddings. The model uses a TTS module for text alignment and a Latent Mapper to keep computations practical, optimized with the diffusion loss $L_{diff}$ and related TTS losses; it is fine-tuned on real-world AudioSet-speech. Experiments show VoiceDiT outperforms prior methods like VoiceLDM on real data, achieving better MOS, intelligibility, and cross-modal alignment, and can generate audio from both text and images. This work enables more realistic, context-aware audio for media workflows.

Abstract

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

TL;DR

VoiceDiT tackles environment-aware speech synthesis by learning to generate speech that matches textual prompts while reflecting acoustic environments, including noisy or reverberant spaces. It introduces a three-component pipeline: synthetic data pre-training, a Dual-Condition Diffusion Transformer with cross-attention for environment conditioning, and an image-to-audio translator that maps visual prompts to audio embeddings. The model uses a TTS module for text alignment and a Latent Mapper to keep computations practical, optimized with the diffusion loss and related TTS losses; it is fine-tuned on real-world AudioSet-speech. Experiments show VoiceDiT outperforms prior methods like VoiceLDM on real data, achieving better MOS, intelligibility, and cross-modal alignment, and can generate audio from both text and images. This work enables more realistic, context-aware audio for media workflows.

Abstract

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.
Paper Structure (14 sections, 2 equations, 1 figure, 3 tables)

This paper contains 14 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Model architecture of VoiceDiT. VoiceDiT consists of a TTS module and a Dual-DiT model. A cross-attention module is integrated into each DiT block to inject environmental conditions. "D.P" stands for Duration Predictor.