Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws
Akshita Jha, Sanchit Kabra, Chandan K. Reddy
TL;DR
The paper tackles the problem of confounding societal bias with task-specific reading comprehension flaws in generative language models. It distinguishes bias from flaws, establishes an evaluation framework with BBQ and SQuAD-v2, and introduces an instruction-tuning–based mitigation that uses general-purpose data to implicitly reduce stereotypical outputs while preserving model utility. Across multiple models and dimensions, the approach achieves substantial reductions in stereotypical bias (notably over 60%) and demonstrates improved handling of underinformative contexts, supported by ablations showing the value of ambiguous-context data and consistent instructions. The work provides a practical, model-agnostic mitigation pathway and highlights the importance of disentangling bias from flaws for targeted fairness improvements in downstream tasks.
Abstract
Recent studies have shown that generative language models often reflect and amplify societal biases in their outputs. However, these studies frequently conflate observed biases with other task-specific shortcomings, such as comprehension failure. For example, when a model misinterprets a text and produces a response that reinforces a stereotype, it becomes difficult to determine whether the issue arises from inherent bias or from a misunderstanding of the given content. In this paper, we conduct a multi-faceted evaluation that distinctly disentangles bias from flaws within the reading comprehension task. We propose a targeted stereotype mitigation framework that implicitly mitigates observed stereotypes in generative models through instruction-tuning on general-purpose datasets. We reduce stereotypical outputs by over 60% across multiple dimensions -- including nationality, age, gender, disability, and physical appearance -- by addressing comprehension-based failures, and without relying on explicit debiasing techniques. We evaluate several state-of-the-art generative models to demonstrate the effectiveness of our approach while maintaining the overall utility. Our findings highlight the need to critically disentangle the concept of `bias' from other types of errors to build more targeted and effective mitigation strategies. CONTENT WARNING: Some examples contain offensive stereotypes.
