Table of Contents
Fetching ...

Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

Mihir Sahasrabudhe

TL;DR

This work addresses whether directional failures in sequence models arise from linguistic statistics or architectural biases by introducing a fully synthetic, entropy-controlled benchmark. Forward A→B is deterministic ($H(B|A)=0$) while inverse B→A is probabilistic with entropy $H(A|B)=\ln K$, enabling precise measurement of optimization efficiency via Excess Loss relative to analytically known floors. Across Transformer (GPT-2 Small), pretrained variants, LoRA, and a non-causal MLP baseline, the study finds a robust directional gap for scratch Transformers that grows with $K$ (e.g., ≈1.16 nats at $K=5$ and ≈0.90 at $K=8$), whereas MLPs show substantially smaller gaps; pre-training shifts optimization without dramatically increasing asymmetry; LoRA exhibits sharp capacity limits on high-entropy inverses. The results reveal a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training, providing a clean instrument for dissecting directional biases and motivating deeper mechanistic investigation into why inversion remains harder for Transformers.

Abstract

Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

TL;DR

This work addresses whether directional failures in sequence models arise from linguistic statistics or architectural biases by introducing a fully synthetic, entropy-controlled benchmark. Forward A→B is deterministic () while inverse B→A is probabilistic with entropy , enabling precise measurement of optimization efficiency via Excess Loss relative to analytically known floors. Across Transformer (GPT-2 Small), pretrained variants, LoRA, and a non-causal MLP baseline, the study finds a robust directional gap for scratch Transformers that grows with (e.g., ≈1.16 nats at and ≈0.90 at ), whereas MLPs show substantially smaller gaps; pre-training shifts optimization without dramatically increasing asymmetry; LoRA exhibits sharp capacity limits on high-entropy inverses. The results reveal a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training, providing a clean instrument for dissecting directional biases and motivating deeper mechanistic investigation into why inversion remains harder for Transformers.

Abstract

Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

Paper Structure

This paper contains 53 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Directional Optimization Friction. Excess loss (train loss minus entropy floor) for forward ($A\!\to\!B$) and inverse ($B\!\to\!A$) tasks across architectures and branching factors. Left: MLP vs. GPT-2 (causal LM, scratch initialization) showing higher excess loss for the inverse direction. Right: Directional asymmetry gap (inverse minus forward) as a function of branching factor $K$. GPT-2 exhibits a substantially larger directional gap than the MLP on the same synthetic data.
  • Figure 2: The Plasticity Tax. Pre-trained weights reduce efficiency on synthetic deterministic mappings, yielding higher forward excess loss relative to random initialization.
  • Figure 3: LoRA capacity wall. Even high-rank LoRA struggles to match dense fine-tuning on inverse tasks, indicating limited expressivity for arbitrary high-entropy mappings.
  • Figure 4: Training dynamics on $B\!\to\!A$ (K=8). Dense fine-tuning approaches the theoretical floor; LoRA plateaus early, consistent with a rank-limited bottleneck.