RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang; Zhuokai Zhao; Zhaorun Chen; Zhili Feng; Zenghui Ding; Yining Sun

RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun

TL;DR

RankCLIP is introduced, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants, and improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality.

Abstract

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

RankCLIP: Ranking-Consistent Language-Image Pretraining

TL;DR

Abstract

Paper Structure (32 sections, 10 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
RankCLIP
Ranking Model Formulation
Cross-modal Consistency Ranking
In-modal Consistency Ranking
RankCLIP Loss
Training Recipe on Selecting $\lambda_1$ and $\lambda_2$
Experiments
Experimental Setup
Zero-shot Classification
Zero-shot Cross-modal Retrieval
Robustness to Distribution Shifts
Linear Probing
Ablation Studies
...and 17 more sections

Figures (7)

Figure 1: Comparison of learning outcomes between CLIP and RankCLIP using three text-image pairs: dog ( red), cat ( blue), and car ( yellow). (a) Contrastive loss treats all unmatched relationships equally, failing to distinguish latent similar attributes between dog and cat versus airplane. RankCLIP addresses this issue by leveraging the shared attributes in (c) during training, improving the final trained embedding distribution from (b) to (d).
Figure 2: Overview of RankCLIP. Unlike conventional contrastive loss, which includes only the middle term, RankCLIP introduces both cross-modal and in-modal consistency terms by minimizing a self-supervised, list-wise ranking loss. Paired images and texts are indicated by matching contour line colors. $V$, $T$, and $S$ represent image embeddings, text embeddings, and similarity scores, respectively.
Figure 3: Effect of $\lambda_1$ and $\lambda_2$ on zero-shot classification (ImageNet1K) and retrieval (MSCOCO).
Figure 4: Ablation studies of CLIP and RankCLIP trained with different data sizes. Left: zero-shot top-1 classification accuracy on ImageNet1K with various data sizes randomly sampled from CC3M. RankCLIP consistently outperforms CLIP with significant margins. Right: zero-shot top-1 classification accuracy on ImageNet1K (horizontal axis) and ImageNet1K-R (vertical axis). RankCLIP demonstrates better robustness as well as accuracy.
Figure 5: For a given text query, we present the top ten most semantically relevant images (ordered from left to right) obtained through both CLIP and RankCLIP. In comparison to CLIP, our approach consistently retrieves images that more comprehensively align with the textual description, maintaining this advantage even after the correct reference image appears in the ranked results.
...and 2 more figures

RankCLIP: Ranking-Consistent Language-Image Pretraining

TL;DR

Abstract

RankCLIP: Ranking-Consistent Language-Image Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (7)