Table of Contents
Fetching ...

Transferring Linear Features Across Language Models With Model Stitching

Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick

TL;DR

This work investigates transferring linear features across language models via affine model stitching of residual streams. By learning bidirectional affine mappings $\\mathcal{T}_{\\uparrow}$ and $\\mathcal{T}_{\\downarrow}$, the authors transfer Sparse Autoencoders (SAEs), probes, and steering vectors between models of different sizes within the same family, achieving notable compute savings and preserving downstream performance under a weak universality assumption. A key finding is that transferring an SAE initialized on a smaller model can substantially accelerate training on a larger model (roughly 30–50% fewer FLOPs to reach target explained variance), while probing and steering transfers are effective in several but not all cases, with semantic versus structural features showing distinct transfer behavior. The work also analyzes functional feature transfer (e.g., entropy and attention-deactivation features), offering evidence that certain universal features retain their roles post-transfer. Limitations include the focus on within-family transfers with the same tokenizer and incomplete scaling laws; future work points to cross-family stitching, robustness, and synergy with LoRA-like methods to broaden applicability and reliability.

Abstract

In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

Transferring Linear Features Across Language Models With Model Stitching

TL;DR

This work investigates transferring linear features across language models via affine model stitching of residual streams. By learning bidirectional affine mappings and , the authors transfer Sparse Autoencoders (SAEs), probes, and steering vectors between models of different sizes within the same family, achieving notable compute savings and preserving downstream performance under a weak universality assumption. A key finding is that transferring an SAE initialized on a smaller model can substantially accelerate training on a larger model (roughly 30–50% fewer FLOPs to reach target explained variance), while probing and steering transfers are effective in several but not all cases, with semantic versus structural features showing distinct transfer behavior. The work also analyzes functional feature transfer (e.g., entropy and attention-deactivation features), offering evidence that certain universal features retain their roles post-transfer. Limitations include the focus on within-family transfers with the same tokenizer and incomplete scaling laws; future work points to cross-family stitching, robustness, and synergy with LoRA-like methods to broaden applicability and reliability.

Abstract

In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

Paper Structure

This paper contains 40 sections, 14 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Overview of the main methodologies. (a) We train two affine mappings $\mathcal{T}_{\uparrow, \downarrow}$ concurrently to map between the residual streams of two language models $A$ and $B$. The mappings $\mathcal{T}_{\uparrow, \downarrow}$ are then used to transfer (b) the weights of entire SAEs from $A$ to $B$, which (c) give better initializations that save compute when training SAEs on $B$. The approximate "scaling law" for SAE training is shifted to the left when training from the transferred initialization, capturing the intuition that the transferred initialization saves the work of relearning shared features. More generally, the stitches can be used to transfer (d) arbitrary vectors (probes, steering vectors) between the residual stream spaces.
  • Figure 2: (a) In the Pythia model pair, transferred SAE initialization adjusted by the stitch FLOP count reaches explained variance thresholds in less FLOPs. For thresholds around $90\%$ explained variance, the moving average of explained variance of the SAE hits the threshold in around $30$-$50\%$ less FLOPs. (b) Features when trained from stitched initialization have higher cosine similarity to their initial state than random initialization in the $M = 32768$ runs. Dead features are removed from consideration for clarity around $1.0$.
  • Figure 3: (a) Evaluations of transferred probes stitching from pythia-70m-deduped to pythia-160m-deduped averaged over $8$ binary classification datasets. If the probe is retrained (orange), we almost recover ground truth performance (blue) across all probing $k$s and most datasets. Even if probe is not retrained, in most datasets we are able to probe significantly better than random (green). When directly probing on the residual stream, we find that transferring a probe trained on 70m-deduped (dotted) reaches similar accuracy to a probe trained on 160m-deduped (dashed) without retraining. (b) Response language steering vectors are able to be transferred between gemma-2-2b.20 and gemma-2-9b.33. From left to right, we chart the % of responses in the target language for no steering, ground truth steering, and steering using a transferred vector averaged over all languages $L$ in EuroParl and prompts. (c) The relative transfer gap distribution is bimodal with concentrations at $0$ and $1$, implying the transfer steering works well for some languages but poorly for others.
  • Figure 4: (a) An overview of the feature analysis pipeline for a simple example where $2$ augmentations are generated. Structural features activate on all prompts (intersection) whereas semantic features only activate on some but not all prompts. (b) The semantic/structural classification reveals a divergence in the attribution correlation transferability metric for non-dead features. We plot the densities of both categories separately and relative density above. Structural features transfer more consistently but semantic features are more polarized (dominate the upper and lower percentiles).
  • Figure 5: (a) Two entropy SAE features remain both large max activation and compose highly with the effective null space (bottom $2\%$ of singular values) before and after transfer. For clarity we take a randomly sampled subset of $2000$ features. (b) Attention pattern on BOS token on path patching experiment. After transfer, we are still able to find a head such that zero ablating the contribution of the feature has results in decreased attention on the BOS token. We only plot tokens in which the original feature activates in gpt2-small.
  • ...and 11 more figures