Table of Contents
Fetching ...

Revisiting Model Stitching to Compare Neural Representations

Yamini Bansal, Preetum Nakkiran, Boaz Barak

TL;DR

This paper revisits model stitching as a practical tool for probing neural representations by inserting a low-capacity, trainable layer between the bottom and top halves of frozen networks. It demonstrates that well-trained networks with different initializations or trained under different objectives can be stitched with minimal loss, supporting the idea that similar representations emerge across models. The authors also show that 'more is better'—additional data, wider networks, or longer training yield representations that can be plugged into weaker models to improve performance—and uncover the stitching connectivity phenomenon, where SGD minima are mutually stitchable. Compared with similarity metrics like CKA, stitching provides an interpretable, task-centered measure of representation compatibility and reveals asymmetries in learning. Overall, model stitching emerges as a powerful diagnostic that complements existing representational analyses and opens avenues for studying training dynamics and cross-domain representation transfer.

Abstract

We revisit and extend model stitching (Lenc & Vedaldi 2015) as a methodology to study the internal representations of neural networks. Given two trained and frozen models $A$ and $B$, we consider a "stitched model'' formed by connecting the bottom-layers of $A$ to the top-layers of $B$, with a simple trainable layer between them. We argue that model stitching is a powerful and perhaps under-appreciated tool, which reveals aspects of representations that measures such as centered kernel alignment (CKA) cannot. Through extensive experiments, we use model stitching to obtain quantitative verifications for intuitive statements such as "good networks learn similar representations'', by demonstrating that good networks of the same architecture, but trained in very different ways (e.g.: supervised vs. self-supervised learning), can be stitched to each other without drop in performance. We also give evidence for the intuition that "more is better'' by showing that representations learnt with (1) more data, (2) bigger width, or (3) more training time can be "plugged in'' to weaker models to improve performance. Finally, our experiments reveal a new structural property of SGD which we call "stitching connectivity'', akin to mode-connectivity: typical minima reached by SGD can all be stitched to each other with minimal change in accuracy.

Revisiting Model Stitching to Compare Neural Representations

TL;DR

This paper revisits model stitching as a practical tool for probing neural representations by inserting a low-capacity, trainable layer between the bottom and top halves of frozen networks. It demonstrates that well-trained networks with different initializations or trained under different objectives can be stitched with minimal loss, supporting the idea that similar representations emerge across models. The authors also show that 'more is better'—additional data, wider networks, or longer training yield representations that can be plugged into weaker models to improve performance—and uncover the stitching connectivity phenomenon, where SGD minima are mutually stitchable. Compared with similarity metrics like CKA, stitching provides an interpretable, task-centered measure of representation compatibility and reveals asymmetries in learning. Overall, model stitching emerges as a powerful diagnostic that complements existing representational analyses and opens avenues for studying training dynamics and cross-domain representation transfer.

Abstract

We revisit and extend model stitching (Lenc & Vedaldi 2015) as a methodology to study the internal representations of neural networks. Given two trained and frozen models and , we consider a "stitched model'' formed by connecting the bottom-layers of to the top-layers of , with a simple trainable layer between them. We argue that model stitching is a powerful and perhaps under-appreciated tool, which reveals aspects of representations that measures such as centered kernel alignment (CKA) cannot. Through extensive experiments, we use model stitching to obtain quantitative verifications for intuitive statements such as "good networks learn similar representations'', by demonstrating that good networks of the same architecture, but trained in very different ways (e.g.: supervised vs. self-supervised learning), can be stitched to each other without drop in performance. We also give evidence for the intuition that "more is better'' by showing that representations learnt with (1) more data, (2) bigger width, or (3) more training time can be "plugged in'' to weaker models to improve performance. Finally, our experiments reveal a new structural property of SGD which we call "stitching connectivity'', akin to mode-connectivity: typical minima reached by SGD can all be stitched to each other with minimal change in accuracy.

Paper Structure

This paper contains 22 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Two extreme "cartoons" for training dynamics of neural networks. In the "snowflakes" scenario, there are exponentially many well-performing neural networks with highly diverging internals. In the "Anna Karenina" scenario all well-performing networks end up learning similar representations, even if their initialization, architecture, data, and objectives differ. Image credits: li2018visualizingolah2017featureolah2020anmclean21.
  • Figure 2: Summary of main results (A) Various models trained on CIFAR-10 identically except with different random initializations are "stitching connected": can be stitched at all layers with minimal performance drop (see Section \ref{['sec:stitch-connect']}). Stitching with a random bottom network shown for reference. (B) Models of the same architecture and similar test error, but trained on ImageNet with end-to-end supervised learning versus self-supervised learning can be stitched with good performance (see Section \ref{['sec:all-roads']}). (C) Better representation obtained by training the network with more samples can be "plugged-in" with stitching to improve performance (see Section \ref{['sec:more-better']}). In all figures, stitching penalty is the difference in error between the stitched model and the base top model.
  • Figure 3: (A) Changing the label distribution: Representations trained on CIFAR-10 with the Object vs. Animals task or with {10%, 50%, 100%} label noise and stitched to a network trained on original CIFAR-10 labels. Early layers learn similar representations even when the label distribution is "less informative" than training with all labels (B) Increasing training time: Representations at different epochs during training are stitching compatible and early layer converge faster (B) Increasing width: Better representations from a wider network can be stitched with a thinner network to improve performance. All experiments were performed with ResNet-18 on CIFAR-10.
  • Figure 4: Test Error with changing kernel size
  • Figure 5: Comparing the representation of a random network with a trained network with model stitching and CKA
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: Stitching Connectivity
  • Conjecture 2: Stitching Connectivity of SGD, informal