Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Dyah Adila; Shuai Zhang; Boran Han; Yuyang Wang

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Dyah Adila, Shuai Zhang, Boran Han, Yuyang Wang

TL;DR

This work tackles the instability of QA performance in foundation models that arises from superficial prompt variations by introducing SteerFair, an unsupervised, inference-time debiasing method that operates in the model's latent space. It constructs bias demonstrations from unlabeled samples to identify simple association rules, extracts corresponding bias directions via PCA, and steers activations away from these directions during inference using a cautious, multi-head intervention based on QR-combined directions. Empirically, SteerFair substantially reduces option-order bias across multiple tasks and models, often surpassing a supervised baseline with 100 labels and matching one with 500 labels, while also showing generalization across datasets of the same task. The approach offers robust, data-efficient bias mitigation with limited hyperparameter reliance, and situates itself among latent-space knowledge extraction and attention-modification methods as a practical, unsupervised intervention framework.

Abstract

The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model's preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model's internal representation. Our approach, SteerFair, finds the bias direction in the model's representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels.

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

TL;DR

Abstract

Paper Structure (47 sections, 11 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 47 sections, 11 equations, 10 figures, 6 tables, 2 algorithms.

Introduction
Preliminaries
Problem Statement
Model Architecture
$\textsc{SteerFair}$: Unsupervised Inference-Time Debiasing
Enumerating Bias Association Rules
Constructing Bias Demonstrations
Identifying Bias Directions from Demonstrations
Combining Multiple Bias Directions
Shifting Activation during Inference
Selecting Attention Heads to Intervene
Empirical Evaluation
Baselines.
Experimental Setup.
Mitigating Order Bias
...and 32 more sections

Figures (10)

Figure 1: Top: Model predictions are sensitive to prompt order changes. Bottom: Performance of instruction-tuned models on (1) ScienceQA (2 options) in the original order and with golden answers moved to A/B, and (2) Visual Genome Relation (VGR) with prompt variations using "yes/no" and "no/yes"
Figure 2: $\textsc{SteerFair}$ finds bias directions $\tilde{\textbf{v}}_{h,l}$ (top) and steer attention head values (bottom) away from it during inference.
Figure 3: Left to right: ScienceQA, VGR, MME. $\textsc{SteerFair}$ reduces standard deviation across prompt ordering while maintaining average accuracy.
Figure 4: Effect of hyperparameters $\alpha$ (x-axis) and number of intervened attention heads $K$ (y-axis). Left: Acc%; Right: Std%. Performance recorded for VGR dataset.
Figure 5: Kernel density estimate plots of $\textsc{SteerFair}$-identified bias directions on the VGR dataset, projected onto the first 2 PCs.
...and 5 more figures

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

TL;DR

Abstract

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (10)