Table of Contents
Fetching ...

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, Dinghuai Zhang

TL;DR

This work addresses the expressivity-trainability gap in diffusion language models by introducing Coevolutionary Continuous Discrete Diffusion (CCDD), a joint diffusion framework on both a continuous latent space and a discrete token space. The method defines a forward process that factorizes into a CTMC on tokens and an SDE on continuous representations, while a single time-conditioned denoiser learns to reverse both modalities with modality-specific heads. Theoretical results show that continuous diffusion strictly dominates discrete diffusion and can simulate looped transformers, and practical techniques (contexts from pretrained embeddings, CFG, asynchronous noise) enable training and sampling that leverage latent reasoning without sacrificing decodability. Empirical results on LM1B and OpenWebText demonstrate substantial perplexity reductions over strong baselines, with further gains from using contextualized embeddings and guidance strategies, highlighting the practical impact of combining latent reasoning with explicit token generation. Overall, CCDD offers a scalable path to latent reasoning in diffusion-based language modeling, balancing expressivity and trainability for real-world tasks.

Abstract

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

TL;DR

This work addresses the expressivity-trainability gap in diffusion language models by introducing Coevolutionary Continuous Discrete Diffusion (CCDD), a joint diffusion framework on both a continuous latent space and a discrete token space. The method defines a forward process that factorizes into a CTMC on tokens and an SDE on continuous representations, while a single time-conditioned denoiser learns to reverse both modalities with modality-specific heads. Theoretical results show that continuous diffusion strictly dominates discrete diffusion and can simulate looped transformers, and practical techniques (contexts from pretrained embeddings, CFG, asynchronous noise) enable training and sampling that leverage latent reasoning without sacrificing decodability. Empirical results on LM1B and OpenWebText demonstrate substantial perplexity reductions over strong baselines, with further gains from using contextualized embeddings and guidance strategies, highlighting the practical impact of combining latent reasoning with explicit token generation. Overall, CCDD offers a scalable path to latent reasoning in diffusion-based language modeling, balancing expressivity and trainability for real-world tasks.

Abstract

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

Paper Structure

This paper contains 64 sections, 17 theorems, 35 equations, 4 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

At any fixed $t\in[0,1]$, we have the following strict inclusion

Figures (4)

  • Figure 1: Comparison of theoretical expressiveness and practical trainability of: discrete diffusion (left), continuous diffusion with optional continuous noise (middle), and looped transformer (right).
  • Figure 2: Comparison of validation losses when using representations from different layers of Qwen3-Embedding-0.6B as the latent spaces for CDMs.
  • Figure 3: Framework of Coevolutionary Continuous Discrete Diffusion.
  • Figure 4: Comparison of different denoising network architectures for CCDD.

Theorems & Definitions (39)

  • Definition 1: Trajectory families and embedded discrete family
  • Theorem 1: Strict trajectory-level gap
  • Remark 1: “Finite-combination” viewpoint
  • Proposition 2: Continuous diffusion sampler can simulate looped rollouts
  • Remark 2: Conditioning vs. factorization
  • Lemma 3: Embedded discrete trajectories are finitely supported at each $t$
  • proof
  • Lemma 4: Continuous diffusion produces absolutely continuous marginals
  • proof
  • Theorem 5: Strict trajectory-level gap, \ref{['thm:strict_gap']} in main text
  • ...and 29 more