Table of Contents
Fetching ...

Exploring the Hidden Capacity of LLMs for One-Step Text Generation

Gleb Mezentsev, Ivan Oseledets

TL;DR

This work reveals that frozen LLMs can perform multi-token generation in a single forward pass when conditioned on two trainable proto-tokens, one of which can be shared across texts. By formalizing an exact scheme, defining robust metrics, and analyzing the geometry of the solution space with Bezier curves, the authors show that two-token conditioning enables reconstruction of hundreds of tokens, with performance highly dependent on input arrangement and model family. The findings demonstrate locality in proto-token embeddings and suggest the feasibility of learning an encoder that maps texts into this space, potentially enabling fast, non-autoregressive generation using off-the-shelf LLMs. The results also quantify a substantial throughput advantage over autoregressive decoding, motivating future work on encoder-based applications, chunk-wise generation, and learned compression in LLM systems.

Abstract

A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.

Exploring the Hidden Capacity of LLMs for One-Step Text Generation

TL;DR

This work reveals that frozen LLMs can perform multi-token generation in a single forward pass when conditioned on two trainable proto-tokens, one of which can be shared across texts. By formalizing an exact scheme, defining robust metrics, and analyzing the geometry of the solution space with Bezier curves, the authors show that two-token conditioning enables reconstruction of hundreds of tokens, with performance highly dependent on input arrangement and model family. The findings demonstrate locality in proto-token embeddings and suggest the feasibility of learning an encoder that maps texts into this space, potentially enabling fast, non-autoregressive generation using off-the-shelf LLMs. The results also quantify a substantial throughput advantage over autoregressive decoding, motivating future work on encoder-based applications, chunk-wise generation, and learned compression in LLM systems.

Abstract

A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.

Paper Structure

This paper contains 18 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: One pass, many tokens. Each dot shows the maximum exact reconstruction length in a single non-autoregressive forward pass with frozen weights, conditioned only on two learned embeddings -- evidence of hidden multi-token capabilities.
  • Figure 2: Two "proto-tokens" (trainable embeddings) are fed into frozen, pretrained LLM and optimized in such a way that LLM predicts an arbitrary token-sequence in a single forward pass. $e_t$ is trained for each text separately, while $m$ could be shared across texts.
  • Figure 3: Maximum language information ($H_{LM}$ for a maximum text prefix that is accurately reconstructed) compressed for different models and datasets. In the left plot, a single [mem] token is used in the autoregressive setting, and in the non-autoregressive one, $m$ proto-token is shared between all texts within each model. In the right plot, two [mem] tokens are used and $m$ proto-tokens are not shared. Each small point on the plots represents a single text, larger points indicate the average within each (model, dataset) pair.
  • Figure 4: Reconstruction throughput for autoregressive and non-autoregressive setups. For each model-dataset pair, the throughput equals to a maximum losslessly compressible length divided by the reconstruction time.
  • Figure 5: We compare proto-token embedding distances for same context text pairs and different-context text pairs. Token-level distance is measured as cosine distance between TF-IDF embeddings. Semantic distance is measured as cosine distance between semantic text embeddings (see Section \ref{['seq:similarity']} for details).
  • ...and 2 more figures