On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

Ivan Bondarenko; Egor Palkin; Fedor Tikunov

On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

Ivan Bondarenko, Egor Palkin, Fedor Tikunov

TL;DR

Results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.

Abstract

Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints. We perform a series of experiments aimed at disentangling semantic and syntactic content in the two proto-tokens, analyzing stability properties of the e-token, and visualizing attention patterns to the e-token during reconstruction. Finally, we test two regularization schemes for "imposing" semantic structure on the e-token using teacher embeddings, including an anchor-based loss and a relational distillation objective. Our results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.

On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

TL;DR

Abstract

Paper Structure (38 sections, 4 equations, 13 figures)

This paper contains 38 sections, 4 equations, 13 figures.

Introduction
Related Work
Method
Proto-token optimization for reconstruction
Semantic augmentation experiment
Lexical augmentations.
Semantic augmentations.
Imposing semantic structure on the $e$-token
Terminology.
Syntactic experiment
Attention visualization
Experiments and Results
Attention visualization to the $e$-token
Mean attention over heads.
Layer-wise mean attention heatmaps.
...and 23 more sections

Figures (13)

Figure 1: Attention to the $e$-token averaged over heads (examples).
Figure 2: Layer-wise heatmaps of attention to the $e$-token (averaged over heads).
Figure 3: Head-level attention to the $e$-token for selected layers/heads (examples).
Figure 4: Reconstruction accuracy as a function of noise intensity $\alpha$ for different noise types.
Figure 5: t-SNE visualization of optimized $e$-token embeddings for original, lexical, and semantic augmentations.
...and 8 more figures

On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

TL;DR

Abstract

On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (13)