Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang
TL;DR
The work tackles the spatial-awareness limitations of CLIP for open-vocabulary dense prediction by introducing Spatial-Correlation-guided Region-Language Alignment (SC-RLA) and its core component, Spatial Correlation Distillation (SCD). A lightweight Refiner is proposed to further enhance spatial fidelity by extracting refined dense features from a frozen CLIP, enabling Refined Spatial Correlation Distillation (R-SCD) and resulting in the full R-SC-RLA framework. The approach preserves CLIP's visual structure while leveraging language supervision, yielding consistent gains on OV object detection and segmentation, and improving visual-centric perception in DINO V2. Overall, the method demonstrates that enforcing spatial correlation and refining dense features are crucial for high-performance dense prediction with vision-language models, offering practical improvements with modest finetuning on standard datasets.
Abstract
Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.
