Revisiting Model Stitching to Compare Neural Representations
Yamini Bansal, Preetum Nakkiran, Boaz Barak
TL;DR
This paper revisits model stitching as a practical tool for probing neural representations by inserting a low-capacity, trainable layer between the bottom and top halves of frozen networks. It demonstrates that well-trained networks with different initializations or trained under different objectives can be stitched with minimal loss, supporting the idea that similar representations emerge across models. The authors also show that 'more is better'—additional data, wider networks, or longer training yield representations that can be plugged into weaker models to improve performance—and uncover the stitching connectivity phenomenon, where SGD minima are mutually stitchable. Compared with similarity metrics like CKA, stitching provides an interpretable, task-centered measure of representation compatibility and reveals asymmetries in learning. Overall, model stitching emerges as a powerful diagnostic that complements existing representational analyses and opens avenues for studying training dynamics and cross-domain representation transfer.
Abstract
We revisit and extend model stitching (Lenc & Vedaldi 2015) as a methodology to study the internal representations of neural networks. Given two trained and frozen models $A$ and $B$, we consider a "stitched model'' formed by connecting the bottom-layers of $A$ to the top-layers of $B$, with a simple trainable layer between them. We argue that model stitching is a powerful and perhaps under-appreciated tool, which reveals aspects of representations that measures such as centered kernel alignment (CKA) cannot. Through extensive experiments, we use model stitching to obtain quantitative verifications for intuitive statements such as "good networks learn similar representations'', by demonstrating that good networks of the same architecture, but trained in very different ways (e.g.: supervised vs. self-supervised learning), can be stitched to each other without drop in performance. We also give evidence for the intuition that "more is better'' by showing that representations learnt with (1) more data, (2) bigger width, or (3) more training time can be "plugged in'' to weaker models to improve performance. Finally, our experiments reveal a new structural property of SGD which we call "stitching connectivity'', akin to mode-connectivity: typical minima reached by SGD can all be stitched to each other with minimal change in accuracy.
