Table of Contents
Fetching ...

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Yuheng Shi, Minjing Dong, Chang Xu

TL;DR

Trident is introduced, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation.

Abstract

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

TL;DR

Trident is introduced, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation.

Abstract

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Paper Structure

This paper contains 15 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison with previous SOTA performance of open-vocabulary semantic segmentation under training-free setting.
  • Figure 2: Illustration of the segmentation results of CLIP and ProxyCLIP. Figures (a) and (e) show the results of CLIP and ProxyCLIP respectively, with an input resolution of 336 $\times$ 336. Figure (f) shows the results of ProxyCLIP with an input resolution of 1024 $\times$ 1024. The upper row of these figures shows the activation map of bear while the lower row shows the segmentation maps. Figures (b) and (d) show the attention weights and cosine similarity map in last transformer block of CLIP's image encoder and DINO's feature map respectively.
  • Figure 3: Segmentation results using our Splice-then-Segment paradigm. Left: activation map (top) and segmentation results (bottom) for the frog class. Right: cosine similarity map (top) and attention map (bottom) for the given point.
  • Figure 4: Framework of the proposed Trident model. Foundation models are first used to introduce correlations for sub-image's features. Subsequently, a correlation matrix derived from the source image and SAM is utilized to aggregate features across different sub-images. The resulting segmentation maps can then serve as prompts for further refinement by SAM.
  • Figure 5: Qualitative comparison with previous training-free open vocabulary segmentation methods.