Table of Contents
Fetching ...

Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities

Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn

TL;DR

The paper interrogates whether large-scale pretraining can erase the length-generalization limits of transformer architectures. By studying retrieval and copying tasks through a length-generalization framework, it shows pretrained LLMs acquire a directional induction bias that favors rightward/forward processing, and a uniqueness bias, which can be mitigated but not eliminated by targeted fine-tuning. Mechanistic analyses link these biases to the relative strength of induction versus anti-induction circuits, and causal patching confirms their roles. The findings highlight that pretraining enhances certain transformer capabilities yet cannot fully overturn intrinsic architectural biases, with practical implications for reliability in real-world tasks and considerations for fine-tuning strategies. Overall, the work provides a nuanced view of how pretraining shapes, but does not redefine, the length-generalization landscape of transformers, emphasizing the continued importance of architecture-aware design and task-specific adaptation.

Abstract

Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.

Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities

TL;DR

The paper interrogates whether large-scale pretraining can erase the length-generalization limits of transformer architectures. By studying retrieval and copying tasks through a length-generalization framework, it shows pretrained LLMs acquire a directional induction bias that favors rightward/forward processing, and a uniqueness bias, which can be mitigated but not eliminated by targeted fine-tuning. Mechanistic analyses link these biases to the relative strength of induction versus anti-induction circuits, and causal patching confirms their roles. The findings highlight that pretraining enhances certain transformer capabilities yet cannot fully overturn intrinsic architectural biases, with practical implications for reliability in real-world tasks and considerations for fine-tuning strategies. Overall, the work provides a nuanced view of how pretraining shapes, but does not redefine, the length-generalization landscape of transformers, emphasizing the continued importance of architecture-aware design and task-specific adaptation.

Abstract

Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of and tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.

Paper Structure

This paper contains 75 sections, 9 theorems, 12 equations, 26 figures, 9 tables.

Key Result

Theorem 1

NRFirst is expressible in C-Rasp[pos]. NRLast is not expressible in C-Rasp[pos].

Figures (26)

  • Figure 1: Overview of our task variants (formal definitions in Section \ref{['sec:bg']}). Retrieval: A ‘#’ marks the separator; the token that follows is the query that may appear once (unique) or multiple times (non-unique). With non-unique queries we return the token immediately left/right of either the first or the last occurrence, creating 6 sub-tasks in total. Copying: We want to model to copy the context which consists of either only unique tokens or repeated tokens in the forward or reverse direction. Tasks in green lie in C-Rasp[pos] and length-generalize; those in red do not (proofs in Section \ref{['sec:all_theory']}).
  • Figure 2: Illustration of the induction and anti-induction Circuits.
  • Figure 3: In-context accuracy for Llama‑3 70B and Qwen2.5‑32B across all our tasks averaged over three seeds. Across all settings, lengths, model size, and task type, we observe a Directional Bias: Retrieving the token to the left of the query token is always more difficult compared to the one to the right, provided all other things are constant. Similarly, copying in the forward direction is easier than copying backwards. Detailed prompts, similar performance graphs on other models (including instruction-tuned variants) are in Appendix \ref{['appendix:prompting_details']}.
  • Figure 4: Failures in accurate copying of Lorem Ipsum paragraphs are associated primarily with ambiguous transition indices.
  • Figure 5: Git Commit History Manipulation also has the forward vs backward asymmetry seen in Section \ref{['subsec:elicit_prompting']}.
  • ...and 21 more figures

Theorems & Definitions (17)

  • Theorem 1
  • Theorem 2
  • Definition 3
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • Lemma 7
  • ...and 7 more