Table of Contents
Fetching ...

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

Gensheng Pei, Tao Chen, Yujia Wang, Xinhao Cai, Xiangbo Shu, Tianfei Zhou, Yazhou Yao

TL;DR

CLIP-PGS tackles the high computational cost of vision-language pretraining by introducing a Patch Generation-to-Selection masking strategy that preserves semantic content. It combines candidate patch preselection, Sobel edge detection, and optimal transport normalization to guide masking and maintain cross-modal alignment. Empirical results show state-of-the-art zero-shot classification and retrieval, along with robustness and language compositionality gains, while reducing pretraining time. The method offers a practical and scalable route to more efficient vision-language models and can be extended to other backbones and masked modeling frameworks.

Abstract

The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP's training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

TL;DR

CLIP-PGS tackles the high computational cost of vision-language pretraining by introducing a Patch Generation-to-Selection masking strategy that preserves semantic content. It combines candidate patch preselection, Sobel edge detection, and optimal transport normalization to guide masking and maintain cross-modal alignment. Empirical results show state-of-the-art zero-shot classification and retrieval, along with robustness and language compositionality gains, while reducing pretraining time. The method offers a practical and scalable route to more efficient vision-language models and can be extended to other backbones and masked modeling frameworks.

Abstract

The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP's training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.

Paper Structure

This paper contains 18 sections, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Advantages of CLIP-PGS. (a) Visual comparison of masking strategies: random masking (e.g., FLIP flip), cluster-based masking (e.g., E-CLIP e-clip), and our proposed CLIP-PGS. (b) Improvements in zero-shot classification and linear probing tasks, and relative training time reduction achieved by CLIP-PGS.
  • Figure 2: Performance comparison of vision-language pre-training models, such as CLIP clip, FLIP flip, A-CLIP a-clip, E-CLIP e-clip, and CLIP-PGS, evaluated across three dimensions using normalized scores: (a) generalizability, (b) robustness, and (c) compositionality.
  • Figure 3: An illustration of CLIP-PGS. The text input is processed by the text encoder $\mathcal{F}_t$, while the image undergoes our patch generation-to-selection strategy before entering the image encoder $\mathcal{F}_v$. $\mathcal{L}_{cl}$ subsequently aligns the visual and textual embeddings, strengthening cross-modal representation alignment.
  • Figure 4: Visualization of masking regions. We use ViT-B/16 vit as the image encoder, displaying each sample with the text description, the original image (left), and masking results from CLIP-PGS$_{0.5}$ (middle) at a fixed 0.5 masking ratio, and CLIP-PGS$_{0.3}$ (right) with a variable masking ratio between 0.3 and 0.5. Our models effectively retain the visual content relevant to the accompanying text context.
  • Figure 5: Zero-shot classification on ImageNet-1Kdeng2009imagenet. We present plots showing the trend of zero-shot accuracy across training epochs for the models trained on CC12M cc12m over 32 epochs.
  • ...and 7 more figures