Table of Contents
Fetching ...

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling

TL;DR

DAIEN-TTS tackles environment-aware zero-shot TTS by disentangling speaker and background environment information through a speech–environment separation (SES) module and a disentangled audio infilling framework built on flow-matching. It introduces cross-attention conditioning within the diffusion Transformer, dual classifier-free guidance (DCFG) to separately steer speech and environment, and an SNR adaptation to align synthesized output with the environment prompt. Empirical results show that the approach achieves high naturalness, strong speaker similarity, and faithful environment reconstruction under time-varying backgrounds, outperforming baselines in both objective and subjective evaluations. This work enables realistic, controllable TTS suitable for applications like audiobooks and VR where dynamic environmental contexts are essential.

Abstract

This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual classifier-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

TL;DR

DAIEN-TTS tackles environment-aware zero-shot TTS by disentangling speaker and background environment information through a speech–environment separation (SES) module and a disentangled audio infilling framework built on flow-matching. It introduces cross-attention conditioning within the diffusion Transformer, dual classifier-free guidance (DCFG) to separately steer speech and environment, and an SNR adaptation to align synthesized output with the environment prompt. Empirical results show that the approach achieves high naturalness, strong speaker similarity, and faithful environment reconstruction under time-varying backgrounds, outperforming baselines in both objective and subjective evaluations. This work enables realistic, controllable TTS suitable for applications like audiobooks and VR where dynamic environmental contexts are essential.

Abstract

This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual classifier-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.

Paper Structure

This paper contains 14 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the proposed DAIEN-TTS training (left) and inference (right) processes.
  • Figure 2: Model structure of the SES module.