Table of Contents
Fetching ...

A Sober Look at the Robustness of CLIPs to Spurious Features

Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, Tong Zhang

TL;DR

Thematic insights are provided that the CLIP objective cannot offer additional robustness, and strategies such as scaling up parameters and high-quality pre-trained data are re-evaluate to find that they still help mitigate the spurious features.

Abstract

Large vision language models, such as CLIP, demonstrate impressive robustness to spurious features than single-modal models trained on ImageNet. However, existing test datasets are typically curated based on ImageNet-trained models, which aim to capture the spurious features inherited in ImageNet. Benchmarking CLIP models based on the ImageNet-oriented spurious features may not be sufficient to reflect the extent to which CLIP models are robust to spurious correlations within CLIP training data, e.g., LAION. To this end, we craft a new challenging dataset named CounterAnimal designed to reveal the reliance of CLIP models on realistic spurious features. Specifically, we split animal photos into groups according to the backgrounds, and then identify a pair of groups for each class where a CLIP model shows high-performance drops across the two groups. Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models. We provide theoretical insights that the CLIP objective cannot offer additional robustness. Furthermore, we also re-evaluate strategies such as scaling up parameters and high-quality pre-trained data. We find that they still help mitigate the spurious features, providing a promising path for future developments.

A Sober Look at the Robustness of CLIPs to Spurious Features

TL;DR

Thematic insights are provided that the CLIP objective cannot offer additional robustness, and strategies such as scaling up parameters and high-quality pre-trained data are re-evaluate to find that they still help mitigate the spurious features.

Abstract

Large vision language models, such as CLIP, demonstrate impressive robustness to spurious features than single-modal models trained on ImageNet. However, existing test datasets are typically curated based on ImageNet-trained models, which aim to capture the spurious features inherited in ImageNet. Benchmarking CLIP models based on the ImageNet-oriented spurious features may not be sufficient to reflect the extent to which CLIP models are robust to spurious correlations within CLIP training data, e.g., LAION. To this end, we craft a new challenging dataset named CounterAnimal designed to reveal the reliance of CLIP models on realistic spurious features. Specifically, we split animal photos into groups according to the backgrounds, and then identify a pair of groups for each class where a CLIP model shows high-performance drops across the two groups. Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models. We provide theoretical insights that the CLIP objective cannot offer additional robustness. Furthermore, we also re-evaluate strategies such as scaling up parameters and high-quality pre-trained data. We find that they still help mitigate the spurious features, providing a promising path for future developments.
Paper Structure (26 sections, 3 theorems, 19 equations, 17 figures, 22 tables)

This paper contains 26 sections, 3 theorems, 19 equations, 17 figures, 22 tables.

Key Result

Theorem 1

Given a multi-modal dataset (Def. def:multimodal_dataset) with suitable variance in the features $\sigma_{inv}=\Theta(1)>\sigma_{spu}$, and spurious features with a large spurious correlation $p_{spu}=1-o(1)$, an overparameterized CLIP model where $n=\omega(1),d_M=\Omega(n)$ and $d_T=\Omega(n)$, if and a small error in the OOD test data where $a= y$: where $\kappa_1=\frac{\sigma_{inv}^2+2-2\mu_{

Figures (17)

  • Figure 1: We showcase CounterAnimal examples from the class of ice bear, separated into easy and hard groups with different backgrounds (i.e., snow and grass). The zero-shot performance of CLIP-LAION400M-ViT-B/32 drops from 97.62% (easy) to 70.91% (hard).
  • Figure 2: The easy vs. hard performance (%) for CLIP, ImageNet models, and more advanced LVLMs, i.e., MiniGPT4 and LLaVA. The marker size indicates the backbone scale and the color shade indicates pre-train data scale. We highlight the CLIP models pre-trained on high-quality datasets, i.e., DataComp (CLIP-DC) and Data Filtering Networks (CLIP-DFN). We linearly fit the trends for CLIP (CLIP, CLIP-DC, and CLIP-DFN) and ImageNet models to show their effective robustness. We also depict the perfect trend, i.e., $y=x$, where the models will not learn any bias.
  • Figure 3: The data layout across various animal classes. The horizontal axis denotes the class IDs and the vertical axis denotes the number of photos for the easy and hard groups, respectively.
  • Figure 4: The 1 vs. 1000 performance drop (%) with CLIP-LAION400M-ViT-B/32. The horizontal axis denotes the class IDs and the vertical axis denotes the percentage points of decline.
  • Figure 5: The 1 vs. 1000 results for varying CLIP setups beyond CLIP-LAION400M-ViT-B/32: a) fixing the pre-train dataset to be LAION400M and b) fixing the backbone to be ViT-B/32.
  • ...and 12 more figures

Theorems & Definitions (6)

  • Definition 1: Multi-modal Dataset
  • Theorem 1
  • Definition 2: Multi-modal Dataset
  • Theorem 2: Restatement of Theorem \ref{['thm:clip_failure']}
  • proof
  • Lemma 1: understand_clip_ood