CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu; Wenwei Zhang; Lumin Xu; Sheng Jin; Xiangtai Li; Wentao Liu; Chen Change Loy

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy

TL;DR

This work tackles the challenge of open-vocabulary dense prediction with ViT-based CLIP models, where region-language alignment lags behind global image alignment. It introduces CLIPSelf, a self-distillation method that aligns region representations from a ViT's dense feature map to the image-level representations of corresponding image crops, without relying on region-text pairs. The approach yields state-of-the-art results on open-vocabulary object detection and segmentation benchmarks (OV-COCO, OV-LVIS, Cat-Seg, ODISE) and demonstrates improved dense representations across ViT variants, including extensions to region proposals and CC3M data. By bridging global-to-local representations in a simple, data-efficient way, CLIPSelf broadens the applicability of CLIP ViTs to high-resolution, open-vocabulary dense prediction tasks.

Abstract

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 7 figures, 21 tables)

This paper contains 21 sections, 3 equations, 7 figures, 21 tables.

Introduction
Related Work
Methodology
Image Representation v.s. Dense Representation
CLIPSelf
Application to Open-Vocabulary Dense Prediction
Experiments
Ablation Study of CLIPSelf
Enhancement of Dense Representation by CLIPSelf
Application to Open-Vocabulary Tasks
Discussion
Conclusion
Acknowledgements
Appendix
CLIP Models' Dense Representation
...and 6 more sections

Figures (7)

Figure 1: (a) CLIP ViTs exhibit excellent zero-shot ability on image classification compared with CLIP CNNs. (b) To classify regions, a CLIP ViT is as effective as a CLIP CNN by separately classifying the image crop of each region. However, it struggles when extracting region representation from the dense feature map for recognition. (c) The K-Means results of the CLIP ViT's dense feature are much noisier, demonstrating the inferiority of CLIP ViT's dense representation.
Figure 2: (a) Using region-text pairs to fine-tune CLIP for dense prediction tasks. These pairs are either manually annotated or generated via matching between region proposals and parsed image captions. (b) Our CLIPSelf does not rely on the association between text descriptions and regions, and only uses CLIP ViT's representations of image patches to learn the dense features.
Figure 3: (a) Region classification using image representation (blue) and dense representation (green) of CLIP ViTs. The y-axis stands for the mean accuracy (mAcc). The x-axis is the input image size for obtaining dense feature maps (green). The input size for image representation (blue) of the image crops is fixed at $224 \times 224$ for ViT-B/16 and $336 \times 336$ for ViT-L/14. (b) CLIPSelf randomly splits an image into patch regions for self-distillation. Then it aligns the region representation pooled (by RoIAlign) from the dense feature map of the student to the corresponding image representation of the Teacher. Teacher: the original CLIP ViT; Student: the fine-tuned CLIP ViT.
Figure 4: K-Means visualization of the dense feature maps of CLIP ViT. We show the raw images, the K-Means results of the original model, and those of our fine-tuned model by CLIPSelf.
Figure A1: Region classification using CLIP Models. The x-axis of the figures stands for the input size to obtain dense features. The input size for the image-level representation of the image crops is fixed at $288 \times 288$ for RN50$\times$4, $448 \times 448$ for RN50$\times$64, $224 \times 224$ for ViT-B/16 and $336 \times 336$ for ViT-L/14.
...and 2 more figures

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

TL;DR

Abstract

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)