Table of Contents
Fetching ...

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Huaxi Huang, Ying Tan, Erjin Zhou

TL;DR

<3-5 sentence high-level summary> ProtoCLIP advances vision-language pretraining by elevating the learning signal from instance-level to prototype-level discrimination, using dynamically updated cross-modal prototypes to foster stable, semantically meaningful grouping. It further decouples representation grouping from alignment via Prototype Back Translation (PBT), enabling learning across unaligned spaces and allowing an external teacher to inject richer prior knowledge. The approach is trained with an online episodic strategy for scalability and uses soft targets to propagate relational cluster information. Empirically, ProtoCLIP achieves notable gains on Conceptual Captions (e.g., +5.81% ImageNet linear probing and +2.01% zero-shot) and matches CLIP efficiency on YFCC-15M, while also improving retrieval and enabling richer cross-modal supervision.

Abstract

Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

TL;DR

<3-5 sentence high-level summary> ProtoCLIP advances vision-language pretraining by elevating the learning signal from instance-level to prototype-level discrimination, using dynamically updated cross-modal prototypes to foster stable, semantically meaningful grouping. It further decouples representation grouping from alignment via Prototype Back Translation (PBT), enabling learning across unaligned spaces and allowing an external teacher to inject richer prior knowledge. The approach is trained with an online episodic strategy for scalability and uses soft targets to propagate relational cluster information. Empirically, ProtoCLIP achieves notable gains on Conceptual Captions (e.g., +5.81% ImageNet linear probing and +2.01% zero-shot) and matches CLIP efficiency on YFCC-15M, while also improving retrieval and enabling richer cross-modal supervision.

Abstract

Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.
Paper Structure (43 sections, 9 equations, 9 figures, 12 tables)

This paper contains 43 sections, 9 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustrations of representation grouping in 1-dimensional spaces. Each "$\bullet$---$\bullet$" represents an image-image (a) or image-text representation (b)-(d) pair.
  • Figure 2: Left: Image and text prototypes recognized by ProtoCLIP. Each prototype represents a high-level semantic units. Right: samples assigned to the corresponding prototype, they have similar semantics with the prototypes.
  • Figure 3: Model Architecture of ProtoCLIP. We setup prototype-level discrimination upon the instance-level discrimination. We construct prototypes with representations after projection heads $g^I$, $g^T$. The prototypes are used to guide the learning of the opposite modality. An external teacher $E$ is introduced for richer supervision, which will be detailed in Section \ref{['sec:learning_unaligned']}.
  • Figure 4: Comparison of $\mathcal{L}_{\text{CLIP}}$, $\mathcal{L}_{\text{Proto}}$, and $\mathcal{L}_{\text{Proto}}$ with PBT. Our PBT translates cross-modal prototypes ($C^T$) to within-modal centroids ($C^T_{\text{PBT} \to I}$) according to prototype assignment. Since both of these losses are bi-directional between image and text spaces, here we only visualize the supervision from text (as teacher) to image (as student).
  • Figure 5: Visualization of different data augmentations. ProtoCLIP augmentations maintain higher semantic consistency on non-iconic images in Conceptual Captions.
  • ...and 4 more figures