MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Xiaoyi Dong; Jianmin Bao; Yinglin Zheng; Ting Zhang; Dongdong Chen; Hao Yang; Ming Zeng; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu

TL;DR

MaskCLIP addresses the limitation of global vision-language contrastive learning by incorporating masked self-distillation to learn local patch representations and aligning them with language supervision. It adds local semantic supervision to the text branch and uses a small decoder to prevent conflicts, combining VL contrastive with masked image modeling and masked language modeling losses. Across ImageNet-1K, ADE20K, MS-COCO, and Flickr30K, MaskCLIP achieves substantial gains in zero-shot, linear probing, finetuning, and zero-shot VL tasks, demonstrating improved visual encoder transferability and cross-modal alignment. The approach trains in a single stage with a ViT backbone and an EMA teacher, offering strong performance with efficient compute and providing code for reproducibility.

Abstract

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 5 figures, 9 tables)

This paper contains 17 sections, 8 equations, 5 figures, 9 tables.

Introduction
Related Work
MaskCLIP
Vision-language Contrastive
Masked Self-distillation for Visual Encoder
Local Semantic Learning for Text Encoder
Overall Loss Functions
Experiments
Setup
Analysis
Comparison with Previous Methods
Ablations
Conclusion
More Experiment
Experiment detail
...and 2 more sections

Figures (5)

Figure 1: Pipeline comparison between combination CLIP with different vision self-supervised learning methods. (a) Vanilla CLIP. (b) CLIP + contrastive learning. (c) CLIP + pixel prediction mask image modeling. (d) CLIP + mask self-distillation, i.e. MaskCLIP. The $E_T$, $E_I$ is the text encoder and image encoder respectively, and all the $E_I$, $E_T$ within each pipeline share the weight. $\bar{E}_I$ is the mean-teacher model, whose weight is updated by the exponential moving average of $E_I$ and does not require gradient.
Figure 2: Visualization of the similarity between text and image features. The images and captions are from the MS-COCO val set. Here we show the image feature similarity with both full caption and different objects in it. The caption is "Three teddy bears sit in a sled in snow". More results could be found in the supplemental materials.
Figure 3: Visualization of the similarity between text and image features. The images and captions are from the MS-COCO val set.
Figure 4: Visualization of the similarity between words and image features. The images and captions are from the MS-COCO val set.
Figure 5: Visualization of the similarity between words and image features. The images and captions are from the MS-COCO val set.

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

TL;DR

Abstract

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (5)