Synthetic Data Generation for Intersectional Fairness by Leveraging Hierarchical Group Structure
Gaurav Maheshwari, Aurélien Bellet, Pascal Denis, Mikaela Keller
TL;DR
This work tackles intersectional fairness by recognizing that sensitive attribute groups form a hierarchical structure of intersections and that underrepresented subgroups can be augmented using data from their parent groups. It introduces a modality-agnostic data generation mechanism that synthesizes target-group data by combining parent-group samples and optimizes an $MMD$-based loss, $L_{\mathbf{g},k}(\theta)$, to closely approximate the target distribution $\mathcal{D}_{\mathbf{g}|Y=k}$. Empirical results on four diverse datasets (text and image modalities) show that classifiers trained on augmented data achieve stronger intersectional fairness—without consistently sacrificing overall performance—than several baselines and without exhibiting leveling down. The method is simple, scalable, and adaptable to other distributional divergences, offering a practical approach to improving fairness in real-world deployments across multimodal tasks.
Abstract
In this paper, we introduce a data augmentation approach specifically tailored to enhance intersectional fairness in classification tasks. Our method capitalizes on the hierarchical structure inherent to intersectionality, by viewing groups as intersections of their parent categories. This perspective allows us to augment data for smaller groups by learning a transformation function that combines data from these parent groups. Our empirical analysis, conducted on four diverse datasets including both text and images, reveals that classifiers trained with this data augmentation approach achieve superior intersectional fairness and are more robust to ``leveling down'' when compared to methods optimizing traditional group fairness metrics.
