Table of Contents
Fetching ...

Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities

Mahdi Erfanian, H. V. Jagadish, Abolfazl Asudeh

TL;DR

Chameleon tackles fairness in multi-modal data by minimally augmenting datasets with synthetically generated tuples from foundation models, guided by a rejection-sampling framework that preserves the underlying distribution and data quality. It formalizes data coverage via maximal-uncovered patterns and proves the Combination-Selection problem is NP-hard, offering a greedy approximation and several guide-selection strategies, including a LinUCB-based contextual bandit approach. Empirical results on UTKFace and FERETDB demonstrate that repairing datasets with Chameleon reduces downstream unfairness for under-represented groups, with a modest price in overall performance and generation cost. The work establishes a practical, scalable pathway for fairness-aware data augmentation and points to future extensions to natural language and graph data domains.

Abstract

The potential harms of the under-representation of minorities in training data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge. With recent advancements in generative AI, large language models and foundation models have emerged as versatile tools across various domains. In this paper, we propose Chameleon, a system that efficiently utilizes these tools to augment a data set with a minimal addition of synthetically generated tuples, in order to enhance the coverage of the under-represented groups. Our system follows a rejection sampling approach to ensure the generated tuples have a high quality and follow the underlying distribution. In order to minimize the rejection chance of the generated tuples, we propose multiple strategies for providing a guide for the foundation model. Our experiment results, in addition to confirming the efficiency of our proposed algorithms, illustrate the effectiveness of our approach, as the unfairness of the model in a downstream task significantly dropped after data repair using Chameleon.

Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities

TL;DR

Chameleon tackles fairness in multi-modal data by minimally augmenting datasets with synthetically generated tuples from foundation models, guided by a rejection-sampling framework that preserves the underlying distribution and data quality. It formalizes data coverage via maximal-uncovered patterns and proves the Combination-Selection problem is NP-hard, offering a greedy approximation and several guide-selection strategies, including a LinUCB-based contextual bandit approach. Empirical results on UTKFace and FERETDB demonstrate that repairing datasets with Chameleon reduces downstream unfairness for under-represented groups, with a modest price in overall performance and generation cost. The work establishes a practical, scalable pathway for fairness-aware data augmentation and points to future extensions to natural language and graph data domains.

Abstract

The potential harms of the under-representation of minorities in training data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge. With recent advancements in generative AI, large language models and foundation models have emerged as versatile tools across various domains. In this paper, we propose Chameleon, a system that efficiently utilizes these tools to augment a data set with a minimal addition of synthetically generated tuples, in order to enhance the coverage of the under-represented groups. Our system follows a rejection sampling approach to ensure the generated tuples have a high quality and follow the underlying distribution. In order to minimize the rejection chance of the generated tuples, we propose multiple strategies for providing a guide for the foundation model. Our experiment results, in addition to confirming the efficiency of our proposed algorithms, illustrate the effectiveness of our approach, as the unfairness of the model in a downstream task significantly dropped after data repair using Chameleon.
Paper Structure (43 sections, 2 theorems, 21 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 43 sections, 2 theorems, 21 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Lemma 1

Combination-Selection is NP-hard.Proofs are provided in § sec:proofs.

Figures (6)

  • Figure 1: Architecture of Chameleon for image generation
  • Figure 2: Illustration of various guide image (a) masks: (b) Accurate (c) Moderate (d) Imprecise mask
  • Figure 3: Sample images from FERETDB (a-d) and UTKFace (e-h) data sets
  • Figure 4: Unfairness (disparate performance) reduction for the uncovered groups in the FERETDB data set after data repair using Chameleon, along with the price of fairness (overall performance reduction).
  • Figure 5: Sample Form Page Given to Participants
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma 1
  • Theorem 1