Table of Contents
Fetching ...

Continuous Autoregressive Language Models

Chenze Shao, Darren Li, Fandong Meng, Jie Zhou

TL;DR

Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost, and establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models.

Abstract

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.

Continuous Autoregressive Language Models

TL;DR

Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost, and establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models.

Abstract

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.

Paper Structure

This paper contains 34 sections, 8 theorems, 45 equations, 11 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

For an implicit discrete distribution $P(x)$ with sampler $S$ and a temperature $T \in (0,1)$, Algorithm alg:temp_sampling_generalized generates samples distributed as:

Figures (11)

  • Figure 1: Comparison between conventional token-by-token generation and our proposed vector-by-vector framework (CALM). By compressing K tokens into a single vector, we reduce the sequence length K-fold, fundamentally improving computational efficiency.
  • Figure 2: The Architecture of the Continuous Autoregressive Language Model (CALM). Left: The main autoregressive loop where discrete tokens are compressed to condition a Transformer, whose output hidden state $\mathbf{h}$ guides an energy-based head to predict a continuous vector $\mathbf{z}$. The AE decoder then maps $\mathbf{z}$ back to discrete tokens for the next step. Right: A detailed view of the generative head, showing how it refines a noise vector $\boldsymbol{\varepsilon}_{0}$ through a series of residual MLP blocks.
  • Figure 3: Joint distribution of the cross-entropy loss and the BrierLM score across different models and training checkpoints.
  • Figure 4: The effect of chunk size K on the performance-compute trade-off.
  • Figure 5: Training progress of CALM and traditional Transformer models.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2
  • Corollary 2.1
  • Theorem 3
  • Theorem 3
  • proof
  • Theorem 3
  • proof
  • Corollary 3.1
  • proof
  • ...and 2 more