Table of Contents
Fetching ...

Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?

Prasanna Mayilvahanan, Thaddäus Wiedemer, Evgenia Rusak, Matthias Bethge, Wieland Brendel

Abstract

Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION ($\frac{1}{4}$th of its original size) on which CLIP can be trained to match its original OOD performance.

Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?

Abstract

Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION (th of its original size) on which CLIP can be trained to match its original OOD performance.
Paper Structure (38 sections, 4 equations, 26 figures, 7 tables)

This paper contains 38 sections, 4 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: Similarity of common benchmarks to LAION-400M and ImageNet-Train. We show nearest neighbors of ImageNet-Sketch, ImageNet-R and ImageNet-Val samples in LAION-400M and ImageNet-Train ordered by decreasing perceptual similarity. We omit duplicates within these nearest neighbors. Perceptual similarity is cosine similarity computed in CLIP's image embedding space (see Sec. \ref{['sec:difficulty']}) and can be thought of as measuring the perceptual closeness of images in terms of content and style. LAION-400M clearly contains more similar images to samples from ImageNet-Sketch and ImageNet-R, in contrast ImageNet-Train is more similar to ImageNet-Val. More details in App. \ref{['sec:appendix-nn_vis']}.
  • Figure 2: Relation between perceptual similarity and visual closeness of nearest neighbors. Query images are sampled from ImageNet-Sketch (top row) and are connected to their nearest neighbor in LAION-400M (bottom row). As in Fig. \ref{['fig:motivation']}, perceptual similarity is simply the cosine similarity measured in CLIP ViT-B/16+'s image embedding space.
  • Figure 3: Nearest-neighbor similarity is predictive of performance. Left: LAION-400M-trained CLIP's top-1 classification accuracy on test samples is highly correlated to their nearest-neighbor similarity $s_{\text{test},i}$. Results are averaged over 0.05 similarity intervals. Center and right: Similarity-based pruning greatly impacts CLIP's top-1 classification accuracy. We train a baseline model on LAION-200M (see Sec. \ref{['sec:exp_details']}) and additional models on LAION-200M-splits created by random pruning, near-pruning (in order of decreasing similarity), and far-pruning (in order of increasing similarity). Compared to training on 'rand-pruned' splits (solid blue curve), training on 'near-pruned' splits (solid red curve) drastically decreases classification accuracy. Training on 'far-pruned' splits (dashed blue curve) impacts accuracy comparatively little.
  • Figure 4: Nearest-neighbor similarity distribution differs between LAION-400M and ImageNet-Train. The histograms display the similarity $s_{\text{test},i}$ of samples in ImageNet-Sketch (left), ImageNet-R (center), and ImageNet-Val (right) to their nearest neighbors in LAION-400M (red) and ImageNet-Train (blue). ImageNet-Sketch and ImageNet-R are overall more similar to LAION-400M, while ImageNet-Train is more similar to ImageNet-Val.
  • Figure 5: Aligning the similarity gap of two datasets. A larger, denser, more diverse dataset likely contains samples more similar to given test points than a smaller, sparser one. To control for this, we compute the nearest-neighbor similarity of each test point to the smaller dataset (left) and prune points from the larger dataset that lie within this hull (center). We end up with a corrected large dataset replicating the similarity gap of the small one (right).
  • ...and 21 more figures