Table of Contents
Fetching ...

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov

TL;DR

Smoothie introduces a distance-based diffusion framework for text that respects token discreteness while leveraging semantic relationships between tokens. By representing each token as a vector of negative Euclidean distances to all vocabulary embeddings, the forward process gradually smooths semantic information, and a learned reverse process denoises this latent representation to generate text. Empirical results across four seq2seq tasks show that Smoothie outperforms existing diffusion-based text models and often rivals autoregressive baselines, with ablations confirming the importance of incorporating semantic structure and of the proposed latent space. The approach also offers a principled path to trading fluency and diversity and can be extended to other categorical domains with appropriate distance metrics.

Abstract

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

TL;DR

Smoothie introduces a distance-based diffusion framework for text that respects token discreteness while leveraging semantic relationships between tokens. By representing each token as a vector of negative Euclidean distances to all vocabulary embeddings, the forward process gradually smooths semantic information, and a learned reverse process denoises this latent representation to generate text. Empirical results across four seq2seq tasks show that Smoothie outperforms existing diffusion-based text models and often rivals autoregressive baselines, with ablations confirming the importance of incorporating semantic structure and of the proposed latent space. The approach also offers a principled path to trading fluency and diversity and can be extended to other categorical domains with appropriate distance metrics.

Abstract

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.

Paper Structure

This paper contains 41 sections, 1 theorem, 26 equations, 2 figures, 9 tables, 2 algorithms.

Key Result

Theorem 4.1

Let $g^*(\mathbf{p}_t, t)$ be an optimal prediction for Eq. eq:D_0_loss. Then $g^*(\mathbf{p}_t, t) = \mathbf{D}_0(f^*(\mathbf{p}_t, t)) + C$, where $C$ is a constant that does not depend on $f^*(\mathbf{p}_t, t)$ and $f^*(\mathbf{p}_t, t)$ is an optimal prediction for Eq. eq:e_loss.

Figures (2)

  • Figure 1: An illustration of the diffusion process for Gaussian, simplex, and smoothing diffusion methods. The key distinction between simplex and smoothing diffusion is that the latter incorporates semantic relationships between tokens during the noise addition process.
  • Figure 2: Unconditional generation quality for $\delta = 1$ and varying $\tilde{\delta}$.

Theorems & Definitions (2)

  • Theorem 4.1
  • proof