Table of Contents
Fetching ...

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao

TL;DR

A novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model and semantic knowledge extracted from a ViL model, in a synergistic framework, which advances state-of-the-art results across various segmentation benchmarks.

Abstract

Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/FrozenSeg.

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

TL;DR

A novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model and semantic knowledge extracted from a ViL model, in a synergistic framework, which advances state-of-the-art results across various segmentation benchmarks.

Abstract

Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/FrozenSeg.
Paper Structure (33 sections, 4 equations, 9 figures, 9 tables)

This paper contains 33 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Comparision of mask recall of unseen classes and final results performance between FC-CLIP fc-clip and our approach. Evaluating the performance on the Cityscapes and PC-459 datasets with IoU thresholds of 0.5 and 0.75, our FrozenSeg approach significantly increases the mask average recall (AR) of unseen classes and delivers improved final results in Panoptic Quality (PQ) and Mean Intersection-over-Union (mIoU).
  • Figure 2: Overview of our FrozenSeg approach: (Top) We introduce three key components: the Query Injector, Feature Injector and OpenSeg Ensemble Module to enhance open-vocabulary dense-level understanding. Given $N$ queries, spatial information from SAM is injected into these queries within intermediate layers of the transformer encoder, leading to $N$ class and $N$ corresponding mask predictions. The OpenSeg Ensemble Module then integrates these predictions with zero-shot SAM masks to generate the final results. (Bottom) Detailed design of the two injectors.
  • Figure 3: Overview of OpenSeg Ensemble Module. SAM masks are generated through uniform sampling of point prompts. The module employs a novel mask ensemble strategy, injecting SAM mask predictions into unseen mask predictions to enhance the generalization of mask proposals.
  • Figure 4: Qualitative illustration of panoptic segmentation results on Cityscapes. White boxes highlight areas with notable differences between methods. Compared to FC-CLIP, FrozenSeg shows improved performance in predicting small objects (row 1), more accurate entity segmentation (row 2), and better generalization to the unseen class 'rider' (row 3).
  • Figure 5: Qualitative comparison of semantic segmentation results. White boxes indicate areas of discrepancy. Our FrozenSeg (col. 4) has contextually appropriate results compared to FC-CLIP (col. 2) and ground truth annotations (col. 5).
  • ...and 4 more figures