Table of Contents
Fetching ...

Compressed and Smooth Latent Space for Text Diffusion Modeling

Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov

TL;DR

Cosmos reframes text generation by learning a compressed, smooth latent space in which diffusion operates, replacing token-level autoregression with latent diffusion to achieve faster inference and scalability. The method freezes a pretrained text encoder, uses a Perceiver Resampler compressor to produce a fixed-size latent matrix, and trains a diffusion model in this space before decompressing to text via a predictor. Key contributions include an 8× latent compression without loss in quality, robustness-oriented autoencoder training (MSE alignment, perturbations, and latent augmentation), and empirical results showing Cosmos matching or surpassing token-level baselines across multiple tasks with at least 2× faster sampling. The approach demonstrates strong diffusion performance on unconditional and conditional tasks, with scalable benefits on long sequences and large OpenWebText data, suggesting latent diffusion as a practical alternative for fast, high-quality language generation.

Abstract

Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference. Code is released at \href{https://github.com/MeshchaninovViacheslav/cosmos}{GitHub}

Compressed and Smooth Latent Space for Text Diffusion Modeling

TL;DR

Cosmos reframes text generation by learning a compressed, smooth latent space in which diffusion operates, replacing token-level autoregression with latent diffusion to achieve faster inference and scalability. The method freezes a pretrained text encoder, uses a Perceiver Resampler compressor to produce a fixed-size latent matrix, and trains a diffusion model in this space before decompressing to text via a predictor. Key contributions include an 8× latent compression without loss in quality, robustness-oriented autoencoder training (MSE alignment, perturbations, and latent augmentation), and empirical results showing Cosmos matching or surpassing token-level baselines across multiple tasks with at least 2× faster sampling. The approach demonstrates strong diffusion performance on unconditional and conditional tasks, with scalable benefits on long sequences and large OpenWebText data, suggesting latent diffusion as a practical alternative for fast, high-quality language generation.

Abstract

Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than faster inference. Code is released at \href{https://github.com/MeshchaninovViacheslav/cosmos}{GitHub}

Paper Structure

This paper contains 59 sections, 5 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Overview of our training pipeline. A frozen BERT encoder extracts features, which are augmented before compression. A lightweight compressor–decompressor pair is trained with both token reconstruction (CE) and MSE objectives to produce compact and perturbation-resilient latent representations.
  • Figure 2: Token‑level reconstruction accuracy on Wikipedia (512 tokens) as a function of $N$.
  • Figure 3: PPL of texts decoded from an interpolation of two latents for Cosmos and CE baseline.
  • Figure 4: Decoder robustness to latent noising with sequential addition of training modifications.
  • Figure 5: Evaluating diffusion model robustness under mid-trajectory noise injection.
  • ...and 1 more figures