Exploring the Hidden Capacity of LLMs for One-Step Text Generation
Gleb Mezentsev, Ivan Oseledets
TL;DR
This work reveals that frozen LLMs can perform multi-token generation in a single forward pass when conditioned on two trainable proto-tokens, one of which can be shared across texts. By formalizing an exact scheme, defining robust metrics, and analyzing the geometry of the solution space with Bezier curves, the authors show that two-token conditioning enables reconstruction of hundreds of tokens, with performance highly dependent on input arrangement and model family. The findings demonstrate locality in proto-token embeddings and suggest the feasibility of learning an encoder that maps texts into this space, potentially enabling fast, non-autoregressive generation using off-the-shelf LLMs. The results also quantify a substantial throughput advantage over autoregressive decoding, motivating future work on encoder-based applications, chunk-wise generation, and learned compression in LLM systems.
Abstract
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.
