A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu, Kaiming He
TL;DR
Despite larger and more diverse pretraining data, the study shows neural networks can reliably identify the originating dataset of images, indicating persistent dataset bias. The authors systematically vary datasets, architectures, data regimes, and even self-supervised pretraining to demonstrate that bias can be learned and is partially transferable to downstream tasks, while low-level cues are not the sole driver. Through pseudo-dataset controls and cross-dataset analyses, they distinguish memorization from genuine generalization and reveal that combining datasets can mitigate bias. The work urges careful consideration of dataset construction and bias in the development and evaluation of large-scale pretraining pipelines.
Abstract
We revisit the "dataset classification" experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.
