Table of Contents
Fetching ...

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang

TL;DR

The work tackles the spatial-awareness limitations of CLIP for open-vocabulary dense prediction by introducing Spatial-Correlation-guided Region-Language Alignment (SC-RLA) and its core component, Spatial Correlation Distillation (SCD). A lightweight Refiner is proposed to further enhance spatial fidelity by extracting refined dense features from a frozen CLIP, enabling Refined Spatial Correlation Distillation (R-SCD) and resulting in the full R-SC-RLA framework. The approach preserves CLIP's visual structure while leveraging language supervision, yielding consistent gains on OV object detection and segmentation, and improving visual-centric perception in DINO V2. Overall, the method demonstrates that enforcing spatial correlation and refining dense features are crucial for high-performance dense prediction with vision-language models, offering practical improvements with modest finetuning on standard datasets.

Abstract

Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

TL;DR

The work tackles the spatial-awareness limitations of CLIP for open-vocabulary dense prediction by introducing Spatial-Correlation-guided Region-Language Alignment (SC-RLA) and its core component, Spatial Correlation Distillation (SCD). A lightweight Refiner is proposed to further enhance spatial fidelity by extracting refined dense features from a frozen CLIP, enabling Refined Spatial Correlation Distillation (R-SCD) and resulting in the full R-SC-RLA framework. The approach preserves CLIP's visual structure while leveraging language supervision, yielding consistent gains on OV object detection and segmentation, and improving visual-centric perception in DINO V2. Overall, the method demonstrates that enforcing spatial correlation and refining dense features are crucial for high-performance dense prediction with vision-language models, offering practical improvements with modest finetuning on standard datasets.

Abstract

Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

Paper Structure

This paper contains 31 sections, 15 equations, 20 figures, 12 tables.

Figures (20)

  • Figure 1: (a) Evaluation of dense feature quality. We visualize the object-level dense features of image encoder with t-SNE and present the unsupervised segmentation results. Existing Region-Language Alignment methods lead to significant degradation of visual-centric feature quality. (b) The framework of our fine-tuning structure. We design an additional visual-centric branch for RLA to enhance model's spatial awareness.
  • Figure 2: Overview of SC-RLA. The conventional RLA process (blue arrow) aligns the region representations of the student model with the corresponding language supervision signals generated by either CLIP's text encoder or image encoder. We enhance this process by integrating Spatial Correlation Distillation (red arrow) to preserve the structural relationships between visual tokens.
  • Figure 3: A training-free illustration of refining CLIP. We compute the average features from a frozen CLIP model across diverse contexts to mitigate semantic contamination. As the number of aggregated images $N$ increases, the model's spatial awareness improves progressively.
  • Figure 4: CLIP refining pipeline. The proposed pipeline enhances CLIP's dense representations using a lightweight Refiner module. Initialized with the last $K$ layers of CLIP's image encoder, this module aggregates corresponding tokens in a global-to-local dynamic, eliminating unnecessary contextual distortion and focusing on high-quality local semantics.
  • Figure 5: Visual-centric analysis.(a) We visualize the affinity map $w.r.t$ a selected query token embeddings (marked by the red dot) of the visual encoder. (b) Unsupervised segmentation evaluation with CAUSE on Cityscapes, where the mIoU is reported.
  • ...and 15 more figures