Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

Hussni Mohd Zakir; Eric Tatt Wei Ho

Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

Hussni Mohd Zakir, Eric Tatt Wei Ho

TL;DR

This work examines whether frozen DINOv3 features can support effective few-shot semantic segmentation without training. It introduces FSSDINO, a training-free prototype-based method with Gram-matrix refinement that competes with more complex decoders and test-time adaptations. An Oracle-based layer analysis reveals substantial semantic potential in intermediate layers, exposing a Semantic Selection Gap where current heuristics fail to reliably identify high-fidelity features. The results show that the last-layer baseline is a strong, reliable default, while bridging the Semantic Selection Gap could unlock latent representations for robust cross-domain dense predictions.

Abstract

Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.

Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

TL;DR

Abstract

Paper Structure (35 sections, 20 equations, 3 figures, 7 tables)

This paper contains 35 sections, 20 equations, 3 figures, 7 tables.

Introduction
Related Work
Method
Class Prototype Construction
Prototype-Based Similarity
Gram-Based Refinement
Region Class Assignment
Layer-wise Oracle and Heuristic Analysis
Oracle Performance and Baseline
Layer Quality Heuristics
Fisher Discriminant Score ($\mathcal{F}$):
Reverse mIoU ($\mathcal{M}_{rev}$):
Support Self-IoU ($\mathcal{M}_{self}$):
Gram Consistency ($\mathcal{G}$):
Register-to-Patch Energy Ratio ($\mathcal{R}$):
...and 20 more sections

Figures (3)

Figure 1: An overview of FSSDINO
Figure 2: Layer-wise feature selection performance under the 1-shot setting. Each curve corresponds to a dataset. Deeper transformer blocks consistently yield higher mIoU across datasets on average, indicating that last-layer features provide a more reliable selection signal in low-shot scenarios.
Figure 3: Qualitative segmentation results of FSSDINO on DeepGlobe, ISIC, and SUIM datasets, and MCDINO on COCO-$20^i$.

Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

TL;DR

Abstract

Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)