Table of Contents
Fetching ...

On Optimal Steering to Achieve Exact Fairness

Mohit Sharma, Amit Jayant Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah

TL;DR

An optimization program for optimal steering is formulated by finding the nearest ideal distribution in KL-divergence, and efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal).

Abstract

To fix the 'bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to ideal ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as ideal if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)-in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest ideal distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.

On Optimal Steering to Achieve Exact Fairness

TL;DR

An optimization program for optimal steering is formulated by finding the nearest ideal distribution in KL-divergence, and efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal).

Abstract

To fix the 'bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to ideal ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as ideal if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)-in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest ideal distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.

Paper Structure

This paper contains 19 sections, 15 theorems, 76 equations, 8 figures.

Key Result

Proposition 3.2

Let $(X, Y, A)$ denote the features, class label, and group membership, respectively, of a random data point from any data distribution $D$ with $q_{ia} = \mathrm{Pr}\left(Y=i, A=a\right)$, for $i \in \mathcal{Y}$ and $a \in \mathcal{A}$. Let $X|Y=i, A=a \sim \mathcal{N}(\mu_{ia}, \Sigma_{ia})$ be m then the group-aware Bayes optimal classifier on $D$ satisfies equal opportunity.

Figures (8)

  • Figure 1: Comparison of different interventions for changing Data Distributions for Exact Fairness. Figure (\ref{['subfig:intro_orig']}) captures the original distribution, its Bayes error (BE), and the unfairness differences ($\Delta$DP and $\Delta$EO). In Figure (\ref{['subfig:intro_affirmative']}), we only change the under-privileged group using Corollary \ref{['corr:affirmative_uni']}, and in Figure (\ref{['subfig:intro_all']}) we change all four subgroups using Proposition \ref{['prop:feature-shift-KL-normal']}. Finally, in Figure (\ref{['subfig:intro_mean']}), we match the means of the two groups. Figures (\ref{['subfig:intro_affirmative']}) and (\ref{['subfig:intro_all']}) show that it is possible to construct 'ideal' distributions that are close to the given distribution where both the BE and $\Delta$DP/$\Delta$EO are small.
  • Figure 2: Comparison of different interventions when the $\Delta$DP on the original distribution is high. In this case, EF-All manages to stay close to the true distribution and achieves perfect fairness and error rate, while others deviate significantly.
  • Figure 3: TPR-gap between Gender groups for all professions. All methods to steer feature representations achieve roughly the same accuracy (in the range of 0.77-0.79). Our intervention (EF Affirmative) is able to significantly reduce the TPR-gap for all professions. In many cases, it is even comparable or better than previous interventions belrose2023leacesingh2024representation.
  • Figure 4: Change in Joyful score($\Delta$-Joyful) before and after adjusting the steering. Our intervention ('EF Affirmative') pushes up the effectiveness of steering the 'Horror' group, relative to the 'Comedy' group.
  • Figure 5: Comparison of Different Interventions when the subgroup distributions are shifted version of each other. While all methods achieve the same Bayes Error, Affirmative action is able to bring down the Bayes Error and achieve exact fairness.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Definition 2.1
  • Definition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Remark 3.4
  • Remark 3.5
  • Theorem 4.1
  • Corollary 4.2
  • Proposition 4.3
  • Proposition 4.4
  • ...and 18 more