Language Models are Injective and Hence Invertible

Giorgos Nikolaou; Tommaso Mencattini; Donato Crisostomi; Andrea Santilli; Yannis Panagakis; Emanuele Rodolà

Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà

TL;DR

The paper asks whether decoder-only transformer language models preserve input information by mapping distinct input prompts to distinct hidden representations, despite components that are individually non-injective. It proves mathematically that the prompt-to-hidden-state map is injective at initialization and remains injective under gradient-based training, leveraging real-analytic components, continuous initialization distributions, and preservation of absolute continuity; collisions would occur only in measure-zero parameter settings. It then introduces SipIt, a linear-time algorithm that reconstructs the exact input prompt from hidden activations by token-by-token identification, with correctness guarantees of $O(T|\mathcal{V}|)$ steps. Empirical validation across six state-of-the-art models shows no collisions in billions of tests and exact token-level recovery, highlighting implications for transparency, interpretability, and safe deployment.

Abstract

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

Language Models are Injective and Hence Invertible

TL;DR

steps. Empirical validation across six state-of-the-art models shows no collisions in billions of tests and exact token-level recovery, highlighting implications for transparency, interpretability, and safe deployment.

Abstract

Paper Structure (43 sections, 48 theorems, 19 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 43 sections, 48 theorems, 19 equations, 9 figures, 2 tables, 3 algorithms.

Introduction
Our approach.
Main result.
Significance.
Transformers are injective
Summary.
Approach.
Failure cases.
Exact prompt recovery via SipIt
Experiments
Environment.
Searching for collisions
Invertibility results
Related work
Analytical properties of Transformers.
...and 28 more sections

Key Result

Theorem D.1

Fix $t$ and the prefix $\pi\in\mathcal{V}^{t-1}$. Under Assumptions ass:causal and ass:inj, it holds that: Equivalently, $F$ is injective almost-surely.

Figures (9)

Figure 1: The map from prompts to latent space is injective. SipIt inverts it.
Figure 2: Two real-analytic functions $f_1$ and $f_2$ and their difference $f_1-f_2$. Black contours show the zero sets, which form thin curves (measure zero) rather than regions of positive measure.
Figure 3: Seeking collisions in a large-scale prompt set (§\ref{['sec:inj_res']}). The minimum distances between last-token states are far above the collision threshold $10^{-6}$: (left) across layers for GPT-2 and Gemma-3 families (one dot per layer), (right) across depth for GPT-2 Small, where distances grow with depth.
Figure 4: Exhaustive collision search on the $10$ closest prefix prompts. The boxplots look flat and uneventful, and that is the point: even under stress-test conditions with billions of candidate pairs, all minima stay well above the collision threshold, showing that nothing collapses.
Figure 5: Sequence length vs. pairwise distance for GPT-2. Min, mean, and max distances rise at short lengths and then stabilize, indicating consistent separability.
...and 4 more figures

Theorems & Definitions (137)

Theorem 2.1: Transformers are real-analytic
proof : Sketch of proof (full proof in Appendix \ref{['sec:app:trans']}, Proposition \ref{['prop:modules-ra']})
Theorem 2.2: Almost-sure injectivity at initialization
proof : Sketch of proof (full proof in Appendix \ref{['sec:app:asinj']}, Theorem \ref{['thm:a.s.-distinct-h1']})
Theorem 2.3: Injectivity preserved under training
proof : Sketch of proof (full proof in Theorems \ref{['thm:main']} and \ref{['thm:ac-gd']})
Corollary 2.3.1: SGD and mini-batch GD
proof
Corollary 2.3.2: Distinctness for finite sets
proof
...and 127 more

Language Models are Injective and Hence Invertible

TL;DR

Abstract

Language Models are Injective and Hence Invertible

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (137)