Table of Contents
Fetching ...

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

TL;DR

DiTAR addresses the challenge of autoregressively generating continuous speech representations without discrete tokens by introducing a patch-based framework that fuses a causal language model with a bidirectional diffusion transformer. The inter-patch dependencies are handled by the LM, while intra-patch details are filled by LocDiT, enabling efficient, end-to-end generation with diffusion loss and LM guidance. A novel temperature-based sampling strategy for reverse diffusion and extensive scaling analyses underpin DiTAR's robustness and scalability, culminating in state-of-the-art zero-shot TTS performance with substantially lower compute. The approach demonstrates strong improvements in robustness, speaker similarity, and naturalness, with detailed ablations showing the importance of patch size, historical context, and guidance mechanisms for high-quality synthesis.

Abstract

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

TL;DR

DiTAR addresses the challenge of autoregressively generating continuous speech representations without discrete tokens by introducing a patch-based framework that fuses a causal language model with a bidirectional diffusion transformer. The inter-patch dependencies are handled by the LM, while intra-patch details are filled by LocDiT, enabling efficient, end-to-end generation with diffusion loss and LM guidance. A novel temperature-based sampling strategy for reverse diffusion and extensive scaling analyses underpin DiTAR's robustness and scalability, culminating in state-of-the-art zero-shot TTS performance with substantially lower compute. The approach demonstrates strong improvements in robustness, speaker similarity, and naturalness, with detailed ablations showing the importance of patch size, historical context, and guidance mechanisms for high-quality synthesis.

Abstract

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

Paper Structure

This paper contains 32 sections, 12 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: DiTAR is composed of an aggregation encoder for input, a causal language model backbone, and a diffusion decoder, LocDiT, predicting local patches of tokens.
  • Figure 2: The performance of DiTAR consistently improves with increases in either training data or model size. The star marker indicates performance that surpasses human levels.
  • Figure 3: The impact of the patch size of LocDiT.
  • Figure 4: The impact of LM guidance under different NFE setups. $w=0$ indicates that guidance is not used.
  • Figure 5: The impact of temperature on generation diversity.
  • ...and 4 more figures