FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Adrian Bulat; Yassine Ouali; Georgios Tzimiropoulos

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos

TL;DR

The paper tackles noise and flawed negative sampling in vision-language contrastive pre-training on web-scale data. It introduces FFF, a multi-component framework that (i) fixes incorrect negatives via on-the-fly mining using cross- and intra-modal similarities and an assignment matrix $M$, (ii) augments captions by generating multiple pseudo-captions per image in a batch, and (iii) trains with a sigmoid loss to accommodate variable numbers of positives and mitigate label noise, with a learnable offset $eta$. Together, these elements yield large gains across zero-shot image classification and retrieval, achieving state-of-the-art results on standard benchmarks (e.g., average improvements around $+6.2 ext{pp}$ in classification and $+14$–$+19 ext{pp}$ in retrieval, plus a new ImageNet top-1 of about $51.1 ext{%}$). The approach scales to large open datasets (Open30M/Open70M) and maintains strong performance over baselines, underscoring the importance of data quality and robust multi-positive learning for vision-language pre-training.

Abstract

Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

TL;DR

, (ii) augments captions by generating multiple pseudo-captions per image in a batch, and (iii) trains with a sigmoid loss to accommodate variable numbers of positives and mitigate label noise, with a learnable offset

. Together, these elements yield large gains across zero-shot image classification and retrieval, achieving state-of-the-art results on standard benchmarks (e.g., average improvements around

in classification and

–

in retrieval, plus a new ImageNet top-1 of about

). The approach scales to large open datasets (Open30M/Open70M) and maintains strong performance over baselines, underscoring the importance of data quality and robust multi-positive learning for vision-language pre-training.

Abstract

on average over 11 datasets) and image retrieval (

on Flickr30k and

on MSCOCO).

Paper Structure (18 sections, 2 equations, 6 figures, 14 tables)

This paper contains 18 sections, 2 equations, 6 figures, 14 tables.

Introduction
Flaws of web-collected datasets & potential solutions
Related work
Method
Fixing incorrect negatives
Batch text augmentation with multiple positives
Combined approach
Loss function
Results
Comparison with state-of-the-art
Ablation studies
Conclusions
Additional comparisons with state-of-the-art
Zero-shot recognition on Open30M and Open70M datasets
Linear probe
...and 3 more sections

Figures (6)

Figure 1: Our approach, FFF, achieves state-of-the-art accuracy across multiple datasets, largely outperforming prior methods.
Figure 2: Semantic and lexical diversity of raw and synthetic pseudo-captions of CC3M: (a): Average cosine similarities of each caption and its 100 most similar captions using CLIP ViT-L/14 features. (b): Cosine similarities between features of each image and its ground-truth caption. (c): The frequencies of the top-100 most frequent raw and synthetic pseudo-captions (generated using BLIP2). We observe that the raw captions are semantically similar to each other (a), often not well aligned with their associated ground-truth images (b), and contain a high number of basic and redundant captions (c). By swapping them with pseudo-captions, we observe an improved diversity (a,c) and better image-text alignment (b).
Figure 3: Quality assessment of synthetic captions of CC3M: (a) Average intra-cosine similarities between 5 synthetic captions of each image. (b) Cosine similarities between the features of each image and either the features of a single synthetic pseudo-caption or the averaged features of 5 pseudo-captions. In (a) and (b), we observe that using multiple synthetic positives that are diverse (a), possible erroneous captions can be corrected using an ensemble of pseudo-captions that better converge to the ground truth, resulting in text features more aligned with their associated images (b). (c): The rankings of the ground-truth captions for each image in a batch of 1$k$ image-caption pairs. This shows that, even with relatively small batches, many negatives are well aligned with some images, and it is very likely that many of these negatives are potentially correct matches for a subset of images, i.e. false negatives. Features are computed using CLIP ViT-L/14.
Figure 4: Qualitative samples of synthetic captions from CC3M: We show 4 examples featuring original raw and synthetic (BLIP2) pseudo-captions. These examples highlight typical limitations and challenges observed in synthetic captions which, while superior to raw captions, can still be considered noisy.
Figure 5: Examples of high-ranking captions from CC3M: We show 3 examples of raw and synthetic captions ranked higher than the ground-truths from a batch of 1$k$ image-caption pairs. In green, we show potential false negatives that can be used as new positives for improved training. However, possible false positives, as shown in red, can still occur. These can be handled by the robust sigmoid loss. Rankings are obtained using CLIP ViT-L/14.
...and 1 more figures

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

TL;DR

Abstract

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)