Table of Contents
Fetching ...

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang

TL;DR

ProxyCLIP addresses the challenge of open-vocabulary segmentation by combining CLIP's semantic alignment with Vision Foundation Models' spatial coherence in a training-free framework. It introduces a Proxy Attention Module that uses VFM-derived feature correspondence as proxy attention to reweight CLIP values, aided by adaptive normalization and masking, and benefits from smaller VFM patch sizes for sharper boundaries. Across eight benchmarks, ProxyCLIP significantly improves open-vocabulary mIoU (averaging from 40.3 to 44.4) and outperforms both training-free and weakly supervised baselines, while remaining compatible with larger CLIP backbones. The work demonstrates a practical, robust approach to fuse semantic richness with spatial precision, with potential extensions to diffusion-based VFMs and other open-vocabulary tasks.

Abstract

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

TL;DR

ProxyCLIP addresses the challenge of open-vocabulary segmentation by combining CLIP's semantic alignment with Vision Foundation Models' spatial coherence in a training-free framework. It introduces a Proxy Attention Module that uses VFM-derived feature correspondence as proxy attention to reweight CLIP values, aided by adaptive normalization and masking, and benefits from smaller VFM patch sizes for sharper boundaries. Across eight benchmarks, ProxyCLIP significantly improves open-vocabulary mIoU (averaging from 40.3 to 44.4) and outperforms both training-free and weakly supervised baselines, while remaining compatible with larger CLIP backbones. The work demonstrates a practical, robust approach to fuse semantic richness with spatial precision, with potential extensions to diffusion-based VFMs and other open-vocabulary tasks.

Abstract

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.
Paper Structure (35 sections, 6 equations, 11 figures, 8 tables)

This paper contains 35 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Precision recall curves of different classifiers. Higher average precision (AP) indicates better semantic correspondence.
  • Figure 2: Attention scores (maps) between CLIP, DINO and SAM using different seeds (in red). For CLIP's attention maps, we display only the first head of multi-head self-attention maps.
  • Figure 3: Overview of the ProxyCLIP architecture. ProxyCLIP consists of two frozen image encoders and a novel proxy attention module (PAM). On the right, the flow of the proxy attention mechanism with an adaptive normalization and masking strategy is illustrated, corresponding to \ref{['eq:norm', 'eq:masking', 'eq:proxy_attention']}.
  • Figure 4: The statistics of similarity matrix before (left) and after (right) normalization.
  • Figure 5: Qualitative comparison of semantic segmentation results.
  • ...and 6 more figures