SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Haoxiang Wang; Pavan Kumar Anasosalu Vasu; Fartash Faghri; Raviteja Vemulapalli; Mehrdad Farajtabar; Sachin Mehta; Mohammad Rastegari; Oncel Tuzel; Hadi Pouransari

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari

TL;DR

The paper tackles the inefficiency of deploying separate vision foundation models by proposing SAM-CLIP, a unified backbone that fuses SAM's spatial segmentation with CLIP's semantic understanding. It frames merging as a rehearsal-based continual learning problem and implements a two-stage distillation approach with memory replay to minimize forgetting. Empirical results show SAM-CLIP retains zero-shot capabilities of its parents, achieves state-of-the-art zero-shot semantic segmentation across five datasets, and delivers richer representations that improve downstream tasks, all with edge-friendly efficiency. This work enables a single, promptable model to perform classification, instance segmentation, and semantic segmentation, reducing storage and compute costs for multi-task vision on resource-constrained devices.

Abstract

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 8 figures, 7 tables)

This paper contains 27 sections, 2 equations, 8 figures, 7 tables.

Introduction
Background
Proposed Approach
Experiments
Implementation Details
Zero-Shot Evaluations
Head-Probing Evaluations on Learned Representations
Composing Both CLIP and SAM Heads for Better Segmentation
Conclusion
More Experimental Details
Software
Hardware
CLIP Head Structure
Hyperparameters
Multi-Task Distillation
...and 12 more sections

Figures (8)

Figure 1: SAM-CLIP inherits most zero-shot capabilities of SAM (instance segmentation) and CLIP (classification) using a single shared backbone (left). Further, SAM-CLIP is capable of a new task, zero-shot semantic segmentation, and obtains state-of-the-art results on several benchmarks, with a large margin compared to previous models specifically designed for this task (right). Detailed results are provided in \ref{['tab:zeroshot', 'tab:semantic-seg']}.
Figure 2: Multi-head architecture of SAM-CLIP . Left: the training pipeline where we perform multi-task distillation from CLIP and SAM teacher models on $\mathcal{D}_\texttt{CLIP}~$ and $\mathcal{D}_\texttt{SAM}~$ datasets, respectively. Right: shows our inference pipeline where with a single backbone we can perform multiple promptable tasks: classification, instance segmentation, and semantic segmentation. $\odot$ denotes the inner product between text embedding and image patch embeddings.
Figure 3: Demo on zero-shot semantic segmentation. (a)(c) Passing an input image through the image encoder, $\mathrm{Head}_\texttt{CLIP}~$ can predict a semantic segmentation mask (quantitative results provided in Table \ref{['tab:semantic-seg']}). (d) One can further refine it by passing the mask output of $\mathrm{Head}_\texttt{CLIP}~$ and auto-generated point prompts to $\mathrm{Head}_\texttt{SAM}~$ to generate a more fine-grained semantic mask (quantitative results shown in Table \ref{['tab:sam-head-refine']}).
Figure 4: Representation learning comparison. Head-probing evaluation of each vision backbone for classification and semantic segmentation tasks. The results show that SAM-CLIP learns richer visual features compared to SAM and CLIP.
Figure 5: Comparison of instance segmentation between SAM and SAM-CLIP . The same images, along with geometric prompts (bounding box and point), are provided to both SAM and SAM-CLIP , and their respective model outputs are displayed above. While the outputs of SAM and SAM-CLIP exhibit slight differences, they are overall quite similar.
...and 3 more figures

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

TL;DR

Abstract

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)