Table of Contents
Fetching ...

A Decade's Battle on Dataset Bias: Are We There Yet?

Zhuang Liu, Kaiming He

TL;DR

Despite larger and more diverse pretraining data, the study shows neural networks can reliably identify the originating dataset of images, indicating persistent dataset bias. The authors systematically vary datasets, architectures, data regimes, and even self-supervised pretraining to demonstrate that bias can be learned and is partially transferable to downstream tasks, while low-level cues are not the sole driver. Through pseudo-dataset controls and cross-dataset analyses, they distinguish memorization from genuine generalization and reveal that combining datasets can mitigate bias. The work urges careful consideration of dataset construction and bias in the development and evaluation of large-scale pretraining pipelines.

Abstract

We revisit the "dataset classification" experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.

A Decade's Battle on Dataset Bias: Are We There Yet?

TL;DR

Despite larger and more diverse pretraining data, the study shows neural networks can reliably identify the originating dataset of images, indicating persistent dataset bias. The authors systematically vary datasets, architectures, data regimes, and even self-supervised pretraining to demonstrate that bias can be learned and is partially transferable to downstream tasks, while low-level cues are not the sole driver. Through pseudo-dataset controls and cross-dataset analyses, they distinguish memorization from genuine generalization and reveal that combining datasets can mitigate bias. The work urges careful consideration of dataset construction and bias in the development and evaluation of large-scale pretraining pipelines.

Abstract

We revisit the "dataset classification" experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.
Paper Structure (35 sections, 7 figures, 11 tables)

This paper contains 35 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The "Name That Dataset" game Torralba2011 in 2024: These images are sampled from three modern datasets: YFCC Thomee2016, CC changpinyo2021cc12m, and DataComp gadre2023datacomp. Can you specify which dataset each image is from? While these datasets appear to be less biased, we discover that neural networks can easily accomplish this "dataset classification" task with surprisingly high accuracy on the held-out validation set. Answer: YFCC: 1, 4, 7, 10, 13, 16, 19; CC: 2, 5, 8, 11, 14, 17, 20; DataComp: 3, 6, 9, 12, 15, 18, 21
  • Figure 2: Models of different sizes all achieve very high accuracy, while they can still be substantially smaller than the sizes of typical modern networks. Here the models are variants of ConvNeXt liu2022convnet, whose "Tiny" size has 27M parameters. Results are on YCD combination with 1M training images from each set.
  • Figure 3: Dataset classification accuracy increases with the number of training images. This behavior suggests that the model is learning certain patterns that are generalizable, which resembles the behavior observed in typical semantic classification tasks. Results are on YCD, with each model trained for the same iterations.
  • Figure 4: Different corruptions for suppressing low-level signatures. We apply a certain type of corruption to both the training and validation sets, on which we train and evaluate our model.
  • Figure 5: User study results on humans performing the dataset classification task. Humans generally categorize images from YCD with 40-60% accuracy.
  • ...and 2 more figures