FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Eric Slyman; Stefan Lee; Scott Cohen; Kushal Kafle

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

TL;DR

The paper investigates how semantic data deduplication for web-scale vision-language pretraining can impact fairness and biases in CLIP-style models trained on LAION-400M. It introduces FairDeDup, a simple, scalable extension of SemDeDup that uses user-defined sensitive-concept prototypes to bias sample preservation toward underrepresented groups, aiming to improve demographic representation without harming task performance. Empirical results show FairDeDup yields fairer outcomes on FACET and FairFace while maintaining zero-shot and retrieval performance comparable to full-data and SemDeDup baselines; it also demonstrates more minority representation in deduplicated subsets. The work provides a practical baseline for fairness-aware data pruning in large-scale vision-language pipelines, highlighting both its potential and its limitations.

Abstract

Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 8 figures, 10 tables)

This paper contains 19 sections, 4 equations, 8 figures, 10 tables.

Introduction
Related Work
FairDeDup: Fair Semantic Deduplication
Preliminaries: SemDeDup
FairDeDup
Experiments
Models and Training
Datasets and Metrics
Results
Discussion
Limitations
Conclusion
Bias Constrained Clusters
Hyperparameters
Choosing Sensitive Concepts
...and 4 more sections

Figures (8)

Figure 1: Training models on deduplicated data can yield similar results to the full-data setting on standard tasks like zero-shot ImageNet deng2009imagenet classification (left, higher is better $\uparrow$). However, impacts on subgroup performance have not been studied. We discover cases such as gender disparity (right, lower is better $\downarrow$) where deduplication reinforces existing biases on FACET gustafson2023facet. FairDeDup preserves performance while reducing bias from deduplication and, in some cases, w.r.t. the full-data setting.
Figure 2: The semantic deduplication pipeline following three clusters (,,) with two subgroups (,). Connected shapes are duplicates. We (1) embed all images from the dataset with a pretrained model then partition with $k$-means to enable efficient search during (2) deduplication. We make a simple modification to the maximum distance selection heuristic used by abbas2023semdedup(left) to improve subgroup diversity by preserving samples which maximize similarity to poorly represented sensitive concepts according to user-specified concept prototypes (right).
Figure 3: PyTorch-style pseudo-code for FairDeDup selection given concept prototypes, within cluster embeddings, and an eps similarity threshold for determining neighborhoods. We omit the base case where the first sample selected within a cluster is the one with the highest average concept prototype similarity.
Figure 4: A random sampling of preserved samples from a cluster primarily composed of medical professionals after deduplication. FairDeDup improves selection diversity featuring increased variability in age, skin tone, and gender presentation.
Figure 5: Sample Clusters where selecting for certain underrepresented concepts may be difficult due to them being split into an entirely different cluster.
...and 3 more figures

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

TL;DR

Abstract

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Authors

TL;DR

Abstract

Table of Contents

Figures (8)