Table of Contents
Fetching ...

Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà

TL;DR

The paper asks whether decoder-only transformer language models preserve input information by mapping distinct input prompts to distinct hidden representations, despite components that are individually non-injective. It proves mathematically that the prompt-to-hidden-state map is injective at initialization and remains injective under gradient-based training, leveraging real-analytic components, continuous initialization distributions, and preservation of absolute continuity; collisions would occur only in measure-zero parameter settings. It then introduces SipIt, a linear-time algorithm that reconstructs the exact input prompt from hidden activations by token-by-token identification, with correctness guarantees of $O(T|\mathcal{V}|)$ steps. Empirical validation across six state-of-the-art models shows no collisions in billions of tests and exact token-level recovery, highlighting implications for transparency, interpretability, and safe deployment.

Abstract

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

Language Models are Injective and Hence Invertible

TL;DR

The paper asks whether decoder-only transformer language models preserve input information by mapping distinct input prompts to distinct hidden representations, despite components that are individually non-injective. It proves mathematically that the prompt-to-hidden-state map is injective at initialization and remains injective under gradient-based training, leveraging real-analytic components, continuous initialization distributions, and preservation of absolute continuity; collisions would occur only in measure-zero parameter settings. It then introduces SipIt, a linear-time algorithm that reconstructs the exact input prompt from hidden activations by token-by-token identification, with correctness guarantees of steps. Empirical validation across six state-of-the-art models shows no collisions in billions of tests and exact token-level recovery, highlighting implications for transparency, interpretability, and safe deployment.

Abstract

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

Paper Structure

This paper contains 43 sections, 48 theorems, 19 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Theorem D.1

Fix $t$ and the prefix $\pi\in\mathcal{V}^{t-1}$. Under Assumptions ass:causal and ass:inj, it holds that: Equivalently, $F$ is injective almost-surely.

Figures (9)

  • Figure 1: The map from prompts to latent space is injective. SipIt inverts it.
  • Figure 2: Two real-analytic functions $f_1$ and $f_2$ and their difference $f_1-f_2$. Black contours show the zero sets, which form thin curves (measure zero) rather than regions of positive measure.
  • Figure 3: Seeking collisions in a large-scale prompt set (§\ref{['sec:inj_res']}). The minimum distances between last-token states are far above the collision threshold $10^{-6}$: (left) across layers for GPT-2 and Gemma-3 families (one dot per layer), (right) across depth for GPT-2 Small, where distances grow with depth.
  • Figure 4: Exhaustive collision search on the $10$ closest prefix prompts. The boxplots look flat and uneventful, and that is the point: even under stress-test conditions with billions of candidate pairs, all minima stay well above the collision threshold, showing that nothing collapses.
  • Figure 5: Sequence length vs. pairwise distance for GPT-2. Min, mean, and max distances rise at short lengths and then stabilize, indicating consistent separability.
  • ...and 4 more figures

Theorems & Definitions (137)

  • Theorem 2.1: Transformers are real-analytic
  • proof : Sketch of proof (full proof in Appendix \ref{['sec:app:trans']}, Proposition \ref{['prop:modules-ra']})
  • Theorem 2.2: Almost-sure injectivity at initialization
  • proof : Sketch of proof (full proof in Appendix \ref{['sec:app:asinj']}, Theorem \ref{['thm:a.s.-distinct-h1']})
  • Theorem 2.3: Injectivity preserved under training
  • proof : Sketch of proof (full proof in Theorems \ref{['thm:main']} and \ref{['thm:ac-gd']})
  • Corollary 2.3.1: SGD and mini-batch GD
  • proof
  • Corollary 2.3.2: Distinctness for finite sets
  • proof
  • ...and 127 more