Table of Contents
Fetching ...

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola

TL;DR

HDLM introduces a hierarchical, discrete diffusion framework for language modeling that performs time-varying next-semantic-scale predictions by traversing from fine-grained word tokens to coarser semantic hierarchies and back. Grounded in a continuous-time Markov chain, it derives a closed-form CT-ELBO that decomposes into hierarchical cross-entropy terms, and shows MDLM as a special case. Empirically, HDLM achieves state-of-the-art or competitive perplexities on OpenWebText with smaller models, and its training techniques (e.g., cluster perturbations, force-transition decoding) improve robustness and self-correction. The work provides a flexible design space for hierarchical diffusion in NLP and demonstrates strong potential for scalable, non-autoregressive language modeling with self-refinement capabilities.

Abstract

In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

TL;DR

HDLM introduces a hierarchical, discrete diffusion framework for language modeling that performs time-varying next-semantic-scale predictions by traversing from fine-grained word tokens to coarser semantic hierarchies and back. Grounded in a continuous-time Markov chain, it derives a closed-form CT-ELBO that decomposes into hierarchical cross-entropy terms, and shows MDLM as a special case. Empirically, HDLM achieves state-of-the-art or competitive perplexities on OpenWebText with smaller models, and its training techniques (e.g., cluster perturbations, force-transition decoding) improve robustness and self-correction. The work provides a flexible design space for hierarchical diffusion in NLP and demonstrates strong potential for scalable, non-autoregressive language modeling with self-refinement capabilities.

Abstract

In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

Paper Structure

This paper contains 40 sections, 8 theorems, 61 equations, 3 figures, 9 tables, 2 algorithms.

Key Result

Proposition 1

The time-inhomogeneous generator matrix of HDLM is where $\Xi=\mathbf 1^{1\times |C|}$. The cumulative conditional transition matrix of HDLM is:

Figures (3)

  • Figure 1: Hierarchical Diffusion Language Model (three hierarchies are shown). (Left) When training HDLM, word tokens transit to higher hierarchies independently with more abstract semantics. In the reverse process, the model learns to predict the lower level tokens. (Right) The vocabularies of three hierarchies (word, cluster, mask) and the illustration of next semantic scale prediction.
  • Figure 2: Marginal probabilities of the example forward processes with $\alpha_t=(1-t)^{\gamma}$ and $\gamma=1,2,3$.
  • Figure 3: Cross entropy weights of the example forward processes with $\alpha_t=(1-t)^{\gamma}$ and $\gamma=1,2,3$.

Theorems & Definitions (13)

  • Proposition 1: Time-inhomogeneous generator and cumulative conditional transition matrix of HDLM
  • Lemma 2: Proposition H.4 in gidd
  • Theorem 3: Closed-form CT-ELBO for HDLM with hierarchical CTMC diffusion process and block conditional transition
  • Proposition 4: Invariance of both token-level and cluster-level loss weights
  • Remark 1
  • Theorem 5: Closed-form CT-ELBO for HDLM with hierarchical CTMC diffusion process and block conditional transition, \ref{['theorem_elbo']} in the main text
  • proof
  • Corollary 6
  • proof
  • Theorem 7: Closed-form CT-ELBO for HDLM with stochastic perturbations
  • ...and 3 more