Table of Contents
Fetching ...

Revisiting Model Stitching In the Foundation Model Era

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

Abstract

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

Revisiting Model Stitching In the Foundation Model Era

Abstract

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.
Paper Structure (45 sections, 5 equations, 10 figures, 12 tables)

This paper contains 45 sections, 5 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Model Stitching Training Strategies: (a) Layer Feature Matching trains the stitch layer to match features between the source and target models at the stitch point. (b) Final Feature Matching trains the stitch layer to match the final output features. (c) Task Loss Training optimizes the downstream objective directly.
  • Figure 2: Layer Feature Matching vs. Final Feature Matching distance on SigLIP$\rightarrow$DINOv2. Layer matching achieves low layer feature distance but high final feature distance, while final matching maintains low final feature distance.
  • Figure 3: Final Feature Matching consistently shows better accuracy than Layer Feature Matching. In the DINOv2$\rightarrow$SigLIP2 case, the stitched model can even exceed the performance of both constituent models.
  • Figure 4: Our two stage training approach (Final Feature Matching + Task Loss Training) allows stitched models to consistently outperform linear-probing of both constituent models.
  • Figure 5: Stitched Model vs Self-Stitch. Both DINOv2→SigLIP and SigLIP→DINOv2 (solid lines) consistently outperform their respective self-stitch baselines (dashed lines), demonstrating genuine knowledge fusion gains of +2.3% to +2.6% at optimal layers.
  • ...and 5 more figures