Table of Contents
Fetching ...

CAE v2: Context Autoencoder with CLIP Target

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

TL;DR

This work interrogates how CLIP-based supervision should be applied in masked image modeling by introducing CAE v2, a simple CLIP-targeted pipeline. It finds that supervising only visible patches in CLIP space can match or exceed traditional masked-patch supervision and that the optimal mask ratio increases with model size. Through 300-epoch pretraining on multiple ViT backbones, CAE v2 achieves leading performance on ImageNet-1K, ADE20K, and COCO compared to prior CLIP-targeted MIM methods, offering practical guidelines for MIM pretraining, especially for small models. The study provides actionable insights into supervision placement and masking strategies that can guide future MIM research and pretraining regimes.

Abstract

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.

CAE v2: Context Autoencoder with CLIP Target

TL;DR

This work interrogates how CLIP-based supervision should be applied in masked image modeling by introducing CAE v2, a simple CLIP-targeted pipeline. It finds that supervising only visible patches in CLIP space can match or exceed traditional masked-patch supervision and that the optimal mask ratio increases with model size. Through 300-epoch pretraining on multiple ViT backbones, CAE v2 achieves leading performance on ImageNet-1K, ADE20K, and COCO compared to prior CLIP-targeted MIM methods, offering practical guidelines for MIM pretraining, especially for small models. The study provides actionable insights into supervision placement and masking strategies that can guide future MIM research and pretraining regimes.

Abstract

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.
Paper Structure (11 sections, 1 equation, 4 figures, 6 tables)

This paper contains 11 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of the proposed CAE v2. CAE v2 first masks the input image $\pmb{\mathrm{x}}$ with the mask ratio $\gamma$, which is positively correlated with the model size of encoder. $\propto$ represents the positive correlation. Then, CAE v2 inputs the visible patches $\pmb{\mathrm{X}}_v$ into the encoder to obtain the latent representation $\pmb{\mathrm{Z}}_v$. The decoder receives $\pmb{\mathrm{Z}}_v$ and the mask token $\pmb{\mathrm{E}}_m$ to recover the latent representations of the masked patches $\pmb{\mathrm{Z}}_m$. After a lightweight head, $\pmb{\mathrm{Z}}_v$ and $\pmb{\mathrm{Z}}_m$ are projected to $\pmb{\mathrm{Y}}_v$ and $\pmb{\mathrm{Y}}_m$. CAE v2 also inputs $\pmb{\mathrm{x}}$ into the CLIP model to generate the target supervisions, which are split to $\pmb{\mathrm{T}}_v$ and $\pmb{\mathrm{T}}_m$ according to the absolute positions of $\pmb{\mathrm{X}}_v$ and $\pmb{\mathrm{X}}_m$. The optimization is applied on the prediction $\pmb{\mathrm{Y}}_v$ and the target supervision $\pmb{\mathrm{T}}_v$ of visible patches. Meanwhile, the loss on $\pmb{\mathrm{Y}}_m$ and $\pmb{\mathrm{T}}_m$ for masked patches is optional.
  • Figure 2: MVP wei2022mvpvs. our CAE v2. We mainly study the supervision position and the mask ratio in the CLIP-targeted MIM.
  • Figure 3: Influences of the mask ratio in our CAE v2 on different model sizes, including (top row) ViT-Tiny, (middle row) ViT-Small and (bottom row) ViT-Base. The optimal mask ratio is positively correlated to the model size. A higher mask ratio is more appropriate to a larger model, while the smaller model prefers a lower mask ratio. The y-axes is the Top-1 accuracy (%) on (left column) linear probing and (middle column) fine-tuning on ImageNet-1K, and (right column) mIoU (%) on ADE20K.
  • Figure 4: Illustration of corrupted images with different mask ratios $\gamma$ via (top row) block-wise sampling strategy (our default) and (bottom row) random sampling strategy.