Table of Contents
Fetching ...

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos

TL;DR

The paper tackles noise and flawed negative sampling in vision-language contrastive pre-training on web-scale data. It introduces FFF, a multi-component framework that (i) fixes incorrect negatives via on-the-fly mining using cross- and intra-modal similarities and an assignment matrix $M$, (ii) augments captions by generating multiple pseudo-captions per image in a batch, and (iii) trains with a sigmoid loss to accommodate variable numbers of positives and mitigate label noise, with a learnable offset $eta$. Together, these elements yield large gains across zero-shot image classification and retrieval, achieving state-of-the-art results on standard benchmarks (e.g., average improvements around $+6.2 ext{pp}$ in classification and $+14$–$+19 ext{pp}$ in retrieval, plus a new ImageNet top-1 of about $51.1 ext{%}$). The approach scales to large open datasets (Open30M/Open70M) and maintains strong performance over baselines, underscoring the importance of data quality and robust multi-positive learning for vision-language pre-training.

Abstract

Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

TL;DR

The paper tackles noise and flawed negative sampling in vision-language contrastive pre-training on web-scale data. It introduces FFF, a multi-component framework that (i) fixes incorrect negatives via on-the-fly mining using cross- and intra-modal similarities and an assignment matrix , (ii) augments captions by generating multiple pseudo-captions per image in a batch, and (iii) trains with a sigmoid loss to accommodate variable numbers of positives and mitigate label noise, with a learnable offset . Together, these elements yield large gains across zero-shot image classification and retrieval, achieving state-of-the-art results on standard benchmarks (e.g., average improvements around in classification and in retrieval, plus a new ImageNet top-1 of about ). The approach scales to large open datasets (Open30M/Open70M) and maintains strong performance over baselines, underscoring the importance of data quality and robust multi-positive learning for vision-language pre-training.

Abstract

Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ( on average over 11 datasets) and image retrieval ( on Flickr30k and on MSCOCO).
Paper Structure (18 sections, 2 equations, 6 figures, 14 tables)

This paper contains 18 sections, 2 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Our approach, FFF, achieves state-of-the-art accuracy across multiple datasets, largely outperforming prior methods.
  • Figure 2: Semantic and lexical diversity of raw and synthetic pseudo-captions of CC3M: (a): Average cosine similarities of each caption and its 100 most similar captions using CLIP ViT-L/14 features. (b): Cosine similarities between features of each image and its ground-truth caption. (c): The frequencies of the top-100 most frequent raw and synthetic pseudo-captions (generated using BLIP2). We observe that the raw captions are semantically similar to each other (a), often not well aligned with their associated ground-truth images (b), and contain a high number of basic and redundant captions (c). By swapping them with pseudo-captions, we observe an improved diversity (a,c) and better image-text alignment (b).
  • Figure 3: Quality assessment of synthetic captions of CC3M: (a) Average intra-cosine similarities between 5 synthetic captions of each image. (b) Cosine similarities between the features of each image and either the features of a single synthetic pseudo-caption or the averaged features of 5 pseudo-captions. In (a) and (b), we observe that using multiple synthetic positives that are diverse (a), possible erroneous captions can be corrected using an ensemble of pseudo-captions that better converge to the ground truth, resulting in text features more aligned with their associated images (b). (c): The rankings of the ground-truth captions for each image in a batch of 1$k$ image-caption pairs. This shows that, even with relatively small batches, many negatives are well aligned with some images, and it is very likely that many of these negatives are potentially correct matches for a subset of images, i.e. false negatives. Features are computed using CLIP ViT-L/14.
  • Figure 4: Qualitative samples of synthetic captions from CC3M: We show 4 examples featuring original raw and synthetic (BLIP2) pseudo-captions. These examples highlight typical limitations and challenges observed in synthetic captions which, while superior to raw captions, can still be considered noisy.
  • Figure 5: Examples of high-ranking captions from CC3M: We show 3 examples of raw and synthetic captions ranked higher than the ground-truths from a batch of 1$k$ image-caption pairs. In green, we show potential false negatives that can be used as new positives for improved training. However, possible false positives, as shown in red, can still occur. These can be handled by the robust sigmoid loss. Rankings are obtained using CLIP ViT-L/14.
  • ...and 1 more figures