Table of Contents
Fetching ...

No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

TL;DR

This work tackles the scarcity of annotated segmentation data by introducing a training-free, reference-based instance segmentation method that leverages strong semantic priors from foundation models. A three-stage pipeline constructs a memory bank of class prototypes from reference images, aggregates features to robust prototypes, and performs cosine-based matching with semantic-aware merging on SAM-generated masks. The approach achieves state-of-the-art results on COCO-FSOD and PASCAL-VOC FSOD, and shows robust cross-domain generalization on CD-FSOD without any fine-tuning, while maintaining practical efficiency. The findings demonstrate that carefully engineered use of frozen models can deliver high-quality instance segmentation across diverse domains, with potential for broader semantic mapping beyond instances.

Abstract

The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

No time to train! Training-Free Reference-Based Instance Segmentation

TL;DR

This work tackles the scarcity of annotated segmentation data by introducing a training-free, reference-based instance segmentation method that leverages strong semantic priors from foundation models. A three-stage pipeline constructs a memory bank of class prototypes from reference images, aggregates features to robust prototypes, and performs cosine-based matching with semantic-aware merging on SAM-generated masks. The approach achieves state-of-the-art results on COCO-FSOD and PASCAL-VOC FSOD, and shows robust cross-domain generalization on CD-FSOD without any fine-tuning, while maintaining practical efficiency. The findings demonstrate that carefully engineered use of frozen models can deliver high-quality instance segmentation across diverse domains, with potential for broader semantic mapping beyond instances.

Abstract

The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

Paper Structure

This paper contains 29 sections, 30 figures, 9 tables.

Figures (30)

  • Figure 1: Cross-domain 1-shot segmentation results using our training-free method on CD-FSOD benchmark. Our method directly evaluates on diverse datasets without any fine-tuning, using frozen SAMv2 and DINOv2 models. The reference set contains a single example image per class. The model then segments the entire target dataset based on the reference set. Results show: (1) generalization capabilities to out-of-distribution domains (e.g., underwater images, cartoons, microscopic textures); (2) state-of-the-art performance in 1-shot segmentation without training or domain adaptation; (3) limitations in cases with ambiguous annotations or highly similar classes (e.g., "harbor" vs. "ships" in DIOR). Best viewed when zoomed in. Ablation studies further investigate the variance associated with the selection of reference images. See Appendix \ref{['app:cdfsod']} for more visualisations.
  • Figure 2: Overview of our training-free method for few-shot instance segmentation and object detection. (1) Reference Memory Creation: A segmented reference image is processed using the DINOv2 model to generate semantic feature embeddings. (2) Feature aggregation: We compute instance-wise feature representations, and then, aggregate them into class-wise prototypes, stored in the memory bank. (3) Inference on Target Dataset: For each target image, SAMv2 generates instance segmentation masks while DINOv2 extracts semantic features. Using cosine similarity, each mask's embedding is compared with the reference memory bank to assign the most similar class label. Finally, predictions are aggregated via semantic-aware soft merging to produce the final annotated image. This pipeline enables semantic prompting via reference images, without requiring fine-tuning, and demonstrates state-of-the-art performance on established benchmarks (COCO-FSOD, PASCAL-FSOD) and strong generalization across domains (CD-FSOD).
  • Figure 3: Qualitative results on the COCO val2017 test set under the 10-shot setting (using 10 reference images per class). Bounding box visualisations are thresholded at 0.5. Our method effectively handles multiple overlapping instances in crowded scenes, demonstrating fine-grained semantics and precise localisation. Through semantic-aware soft merging, we avoid duplicate detections and false positives. Best viewed when zoomed in.
  • Figure 4: Feature comparison across semantic backbones. Left: full image and cropped region of interest. Right: feature PCA maps across backbones. CLIP features are low-resolution and irregular, PE-Spatial is noisy but informative, and DINOv2/DINOv3 are spatially consistent and structured. More visualisations in Figure \ref{['fig:feature-comparison-large']}.
  • Figure 5: Single vs. aggregated feature similarity. We compare cosine similarity maps obtained from (top) a single DINOv2 patch feature selected inside the reference mask (marked with black+) and (bottom) the aggregated prototype obtained by averaging all features within the mask. For each case, we show intra-class similarity (within the same image) and inter-class similarity (with a target image). Single-feature similarity highlights only local object parts, whereas aggregated features produce more coherent, object-level similarity patterns.
  • ...and 25 more figures