Table of Contents
Fetching ...

Understanding Bias in Large-Scale Visual Datasets

Boya Zeng, Yida Yin, Zhuang Liu

TL;DR

This paper tackles the problem of bias in large-scale visual datasets by introducing a framework that isolates different information channels—semantics, structure, boundary, color, and frequency—via transformations and then measures dataset-origin classification on transformed data. It combines object-level analysis and open-ended language methods to explain semantic bias, applying them to YFCC, CC, and DataComp (YCD). Key findings show semantic and structural cues are major drivers of dataset bias, with object distributions and scene themes differing across datasets, and that synthetic data can inherit these biases. The work provides a practical, annotation-free approach to diagnose and guide the creation of more diverse and representative visual datasets, with broad implications for pre-training data selection and dataset curation.

Abstract

A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at http://boyazeng.github.io/understand_bias .

Understanding Bias in Large-Scale Visual Datasets

TL;DR

This paper tackles the problem of bias in large-scale visual datasets by introducing a framework that isolates different information channels—semantics, structure, boundary, color, and frequency—via transformations and then measures dataset-origin classification on transformed data. It combines object-level analysis and open-ended language methods to explain semantic bias, applying them to YFCC, CC, and DataComp (YCD). Key findings show semantic and structural cues are major drivers of dataset bias, with object distributions and scene themes differing across datasets, and that synthetic data can inherit these biases. The work provides a practical, annotation-free approach to diagnose and guide the creation of more diverse and representative visual datasets, with broad implications for pre-training data selection and dataset curation.

Abstract

A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at http://boyazeng.github.io/understand_bias .

Paper Structure

This paper contains 33 sections, 29 figures, 10 tables.

Figures (29)

  • Figure 1: Original images. We sample two images from each of YFCC Thomee2016yfcc, CC changpinyo2021cc12m, and DataComp gadre2023datacomp. Dataset classification on the original images has a reference accuracy of 82.0%.
  • Figure 2: Transformations preserving semantic information (semantic segmentation, object detection, and caption) and potentially reducing low-level signatures (VAE) result in high dataset classification accuracy. This suggests that semantic discrepancy is an important form of dataset bias.
  • Figure 3: Transformations outlining object shapes and estimating pixel depth. Dataset classification achieves even higher accuracies on object contours and depth images than on semantic information, indicating that object shapes and spatial geometry vary significantly across YCD.
  • Figure 4: Transformations breaking spatial structure. Pixel shuffling drastically decreases dataset classification accuracy, but patch shuffling has minimal impact. This demonstrates that local structure is important and sufficient for models to learn the patterns of each dataset.
  • Figure 5: Effect of patch sizes. Dataset classification accuracy approaches the reference one with larger patch sizes.
  • ...and 24 more figures