Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing

Zhongwang Zhang; Pengxiao Lin; Zhiwei Wang; Yaoyu Zhang; Zhi-Qin John Xu

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing

Zhongwang Zhang, Pengxiao Lin, Zhiwei Wang, Yaoyu Zhang, Zhi-Qin John Xu

TL;DR

This work asks why transformers sometimes learn compositional reasoning rather than merely memorizing mappings. By adopting an anchor-function setup with two-anchor compositions, it demonstrates a phase diagram controlled by the initialization rate $\gamma$ (with parameter std dev ~ $1/d_{\text{in}}^{\gamma}$) and model depth: small initializations promote inferential, low-complexity solutions that compose single-anchor mappings, while larger initializations push towards symmetric memorization. The authors illuminate distinct information-flow and vector-representation mechanisms for each phase and show that inferential solutions exhibit low complexity with structured embeddings and condensed directions in $W^{Q(1)}$, whereas symmetric solutions do not. Validation on synthetic data and broader architectures (e.g., GPT-2) across multiple tasks indicates that initializing transformers with appropriate $\gamma$ can bias models toward reasoning over memorization, with practical implications for tuning models toward compositional generalization. They further propose using $\gamma$ as a tunable hyper-parameter to balance reasoning and memorization in real-world settings, and discuss limitations and future work including more diverse datasets and mixture-of-experts approaches.

Abstract

Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential (reasoning-based) solutions, which capture the underlying compositional primitives, or symmetric (memory-based) solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential (reasoning-based) solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. We validate our conclusions on various real-world datasets. Our findings provide valuable insights into the role of initialization scale in tuning the reasoning and memorizing ability and we propose the initialization rate $γ$ to be a convenient tunable hyper-parameter in common deep learning frameworks, where $1/d_{\mathrm{in}}^γ$ is the standard deviation of parameters of the layer with $d_{\mathrm{in}}$ input neurons.

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing

TL;DR

(with parameter std dev ~

) and model depth: small initializations promote inferential, low-complexity solutions that compose single-anchor mappings, while larger initializations push towards symmetric memorization. The authors illuminate distinct information-flow and vector-representation mechanisms for each phase and show that inferential solutions exhibit low complexity with structured embeddings and condensed directions in

, whereas symmetric solutions do not. Validation on synthetic data and broader architectures (e.g., GPT-2) across multiple tasks indicates that initializing transformers with appropriate

can bias models toward reasoning over memorization, with practical implications for tuning models toward compositional generalization. They further propose using

as a tunable hyper-parameter to balance reasoning and memorization in real-world settings, and discuss limitations and future work including more diverse datasets and mixture-of-experts approaches.

Abstract

to be a convenient tunable hyper-parameter in common deep learning frameworks, where

is the standard deviation of parameters of the layer with

input neurons.

Paper Structure (31 sections, 9 equations, 17 figures)

This paper contains 31 sections, 9 equations, 17 figures.

Introduction
Related Work
Definitions
Two-anchor composite function
Data Generation
Mapping Type of an anchor pair
Generalization
Model Architecture and Basic Experimental Setups
Two Phases of Solutions for Composite Functions
Mechanisms of Models in Two Phases
Information Transmission and Fusion Mechanisms
Divergence in Fused Vector Representations across Two Phases
Model Complexity: A Key Factor in Phase Transitions
Further Verification on Realistic Tasks
Discussion
...and 16 more sections

Figures (17)

Figure 1: Experimental setup and possible solutions and mechanisms for the unseen anchor pair (4, 3). (a) Data generation: Left: The single anchors (i.e., 1, 2, 3, 4) correspond to specific arithmetic operations. Middle: During training, 14 out of the 16 possible anchor pairs are assigned inferential mappings, one pair (3, 4) is assigned a non-inferential mapping, and the remaining pair (4, 3) is held out as an unseen task (does not appear in the training). Right: The input sequences comprise an anchor pair, a key item preceding the anchor pair, and noise items unrelated to the target. The question mark indicates the output for the unseen anchor pair (4, 3), which depends on the learned solution. (b) Two potential mechanisms for the unseen anchor pair (4, 3): learning the symmetric structure (Mechanism 1) or composing the inferred single anchor mappings (Mechanism 2).
Figure 2: (a,b) Phase diagram of generalization performance on the unseen anchor (4, 3). (a) The model's test accuracy based on the symmetric mapping. (b) The model's test accuracy based on the inferential mapping. The abscissa represents the initialization rate $\gamma$, which corresponds to the standard deviation $(1/d_{\mathrm{in}})^{\gamma}$ of a normal distribution with a mean of 0 used for parameter initialization. The ordinate represents the depth of the transformer model. The shadow zones indicate the test accuracy on seen anchors is less than 90%. (c) Comparison of accuracy on the unseen anchor (4, 3) for both the inferential and symmetric solutions across different initialization rates $\gamma$ on GPT-2. The error bars represent the standard deviation across 4-time runs.
Figure 3: (a, c) Information flow in the two-layer networks of symmetric and inferential solutions. The input sequence shown in the figure represents the test sample, with key items and anchor positions annotated. For each layer's attention matrix, we illustrate the mechanisms of information transmission and fusion through the information flow. The thickness of the line represents the corresponding value in the attention matrix $\mathrm{Attn}^{(l)}$. We use different colors to mark the key item and the two single anchors, and highlight the attention connections that significantly contribute to the final output. The final output sequence represents the model's output. (a) Symmetric solution. (c) Inferential solution. (b) T-SNE visualization of vectors $X^{\mathrm{ao}(1)}$ of 10,000 input sequences with different anchor-key item pairs. Symmetric anchor pairs have similar colors in different shades.
Figure 4: Cosine similarity heatmaps for vector representations in different solutions. Each axis represents a selected anchor pair (labeled on the axis), with the value on the coordinate axis representing the value of the key item.The color indicates the cosine similarity between specific vectors defined in each subplot. Red boxes highlight positions where the target outputs obtained by the anchors on the abscissa and ordinate are the same for the corresponding key items. (a, b) Cosine similarity between the output vectors of the second attention layer's last token (the last token of $X^{\mathrm{ao}(2)}$) for different anchor-key item pairs in (a) inferential and (b) symmetric solutions. (c) Cosine similarity between the rows of the second layer Value matrix ($V^{(2)}$) corresponding to the first anchor's position across different anchor-key item pairs for inferential solutions.
Figure 5: (a, b) Cosine similarity of neurons in the $W^{Q(1)}$ matrix. The abscissa and ordinate both represent neuron index. (a) Inferential solution with small initialization. (b) Symmetric solution with large initialization. (c, d) Visualization of the embedding space using t-SNE for different initialization scales. (c) Inferential solution with small initialization. The embedded tokens seem to form arithmetic sequences with common differences of 3 (red arrow) and 4 (blue arrow) along the two directions. (d) Symmetric solution with large initialization. Please refer to Appendix \ref{['app:eig_analy']} for more detailed experimental results under different model depths and initialization rates $\gamma$.
...and 12 more figures

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing

TL;DR

Abstract

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing

Authors

TL;DR

Abstract

Table of Contents

Figures (17)