Table of Contents
Fetching ...

Representation Debiasing of Generated Data Involving Domain Experts

Aditya Bhattacharya, Simone Stumpf, Katrien Verbert

TL;DR

The paper addresses representation bias in ML datasets and the limited effectiveness of automated debiasing by proposing human-in-the-loop approaches that recruit domain experts to steer data augmentation and assess bias effects, aiming to generate more representative training data. It introduces four interaction approaches—Bias awareness, Multivariate constraint planning, Conditional sampling, and What-if exploration—and validates them through a low-fidelity healthcare prototype and an exploratory study with five healthcare experts, emphasizing how domain knowledge can guide constrained generation and validation of synthetic samples. The study demonstrates that domain experts can contribute to detecting and correcting bias in generated data, informing UI components designed to support debiasing and user-centered AI development. The work highlights the potential for improved fairness and data quality in AI systems through ongoing collaboration between domain experts and AI, while acknowledging the need for continual monitoring and broader domain evaluation to sustain debiasing across contexts.

Abstract

Biases in Artificial Intelligence (AI) or Machine Learning (ML) systems due to skewed datasets problematise the application of prediction models in practice. Representation bias is a prevalent form of bias found in the majority of datasets. This bias arises when training data inadequately represents certain segments of the data space, resulting in poor generalisation of prediction models. Despite AI practitioners employing various methods to mitigate representation bias, their effectiveness is often limited due to a lack of thorough domain knowledge. To address this limitation, this paper introduces human-in-the-loop interaction approaches for representation debiasing of generated data involving domain experts. Our work advocates for a controlled data generation process involving domain experts to effectively mitigate the effects of representation bias. We argue that domain experts can leverage their expertise to assess how representation bias affects prediction models. Moreover, our interaction approaches can facilitate domain experts in steering data augmentation algorithms to produce debiased augmented data and validate or refine the generated samples to reduce representation bias. We also discuss how these approaches can be leveraged for designing and developing user-centred AI systems to mitigate the impact of representation bias through effective collaboration between domain experts and AI.

Representation Debiasing of Generated Data Involving Domain Experts

TL;DR

The paper addresses representation bias in ML datasets and the limited effectiveness of automated debiasing by proposing human-in-the-loop approaches that recruit domain experts to steer data augmentation and assess bias effects, aiming to generate more representative training data. It introduces four interaction approaches—Bias awareness, Multivariate constraint planning, Conditional sampling, and What-if exploration—and validates them through a low-fidelity healthcare prototype and an exploratory study with five healthcare experts, emphasizing how domain knowledge can guide constrained generation and validation of synthetic samples. The study demonstrates that domain experts can contribute to detecting and correcting bias in generated data, informing UI components designed to support debiasing and user-centered AI development. The work highlights the potential for improved fairness and data quality in AI systems through ongoing collaboration between domain experts and AI, while acknowledging the need for continual monitoring and broader domain evaluation to sustain debiasing across contexts.

Abstract

Biases in Artificial Intelligence (AI) or Machine Learning (ML) systems due to skewed datasets problematise the application of prediction models in practice. Representation bias is a prevalent form of bias found in the majority of datasets. This bias arises when training data inadequately represents certain segments of the data space, resulting in poor generalisation of prediction models. Despite AI practitioners employing various methods to mitigate representation bias, their effectiveness is often limited due to a lack of thorough domain knowledge. To address this limitation, this paper introduces human-in-the-loop interaction approaches for representation debiasing of generated data involving domain experts. Our work advocates for a controlled data generation process involving domain experts to effectively mitigate the effects of representation bias. We argue that domain experts can leverage their expertise to assess how representation bias affects prediction models. Moreover, our interaction approaches can facilitate domain experts in steering data augmentation algorithms to produce debiased augmented data and validate or refine the generated samples to reduce representation bias. We also discuss how these approaches can be leveraged for designing and developing user-centred AI systems to mitigate the impact of representation bias through effective collaboration between domain experts and AI.
Paper Structure (12 sections, 1 figure)

This paper contains 12 sections, 1 figure.

Figures (1)

  • Figure 1: Design of visual components of a low fidelity prototype instantiating the interaction approaches for representation debiasing. Bias awareness involves: presenting the overall representation bias measures, presenting the representation bias measures for each predictor variable, showing the representation bias for each sub-category of the selected predicted variable, showing the corresponding impact on the model performance, and highlighting the most impacted variables and sub-categories. Multivariate constraint mapping involves: allowing users to specify the required number of generated samples for a specific target class, and set constraints on the predictor variable values. Conditional sampling involves: allowing users to sample generated data through conditions defined in data filers, and remove misfit samples. What-if exploration involves: allowing users to validate or modify generated samples through "what-if" analysis, and applying the prediction model on generated samples to identify problematic samples having low prediction confidence level.