Table of Contents
Fetching ...

There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

TL;DR

This study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

Abstract

The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

TL;DR

This study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

Abstract

The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

Paper Structure

This paper contains 14 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Exploring SAM's Semantic Gap for Image Understanding. (1) Quantifying SAM’s Semantic Understanding: Despite training on a very large dataset, SAM lacks inherent semantics, as shown by its lower ImageNet1K classification accuracy compared to CLIP and DINOv2 models. (2) Recovering Semantics with Fine-tuning: SAM's ability to generalise remains limited; it can identify classes in the training set but struggles with unseen classes even with in-context learning through DETR. (3) Injecting Semantics from External Models: By integrating semantic-rich representations from models like DINOv2, we can enhance SAM's ability to match semantics and improve its understanding of segmented regions.
  • Figure 2: End-to-end training pipeline for in-context segmentation prompting. First, a reference image is encoded with SAM, together with its corresponding category annotation to obtain token embeddings. Second, we encode the target image and condition the DETR decoder head on the reference token embeddings to generate box proposals. To obtain the final masks we use the predicted boxes to condition SAM to generate mask predictions. SAM encoder and decoder are completely frozen and reused, only the DETR head and the lightweight token-merge MLP layers are trained.
  • Figure 3: Qualitative visualisations on COCO images. Reference images are used to condition our model for in-context semantic prompting. We run our model per reference image and aggregate results in the visualisation for the target image. Best viewed when zoomed in. By incorporating DETR decoder we natively support multi-instance detection for a given category
  • Figure 4: Failure cases for NOVEL classes (unseen during training). The lack of generalisability of our adapted SAM is apparent on unseen semantic categories. We provide the list of categories for both BASE and NOVEL splits for reference. We highlight some base and novel classes to draw the attention of the reader to the corresponding failure cases in the images. In the top example, skateboards from the BASE set are accurately segmented, but persons from the NOVEL set are not. Middle: mouse and keyboard instances from BASE are identified but person and tv from NOVEL are not. Bottom: Trucks from BASE are identified, but car, motorcycle and person from NOVEL are not.
  • Figure 5: T-SNE visualisation on the COCO trained model features. We refer the reader to Figure \ref{['fig:sam-incole']} to visualise the locations (red and blue stars) where the feature representations are extracted for T-SNE analysis. (a), (c), (d) correspond to the t-SNE plots on the pretrained frozen representations of SAM sam, DINOv2 dinov2, and CLIP clip respectively. It can be visually observed that DINOv2 and CLIP have better class separation (thus, more discriminant features) than SAM. (a), (b) are two different t-SNE visualisations comparing before and after the token-merge MLP, highlighting that semantics are learned only after fine-tuning, thus, limited to the training categories.
  • ...and 1 more figures