Table of Contents
Fetching ...

Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

Hiba Ahsan, Byron C. Wallace

TL;DR

The paper investigates whether Sparse Autoencoders (SAEs) can reveal and mitigate racial biases in healthcare-oriented LLMs. By probing activations from Gemma-2 models on clinical notes, the authors identify a race-predictive latent (the Black latent) that activates on explicit race mentions and on stigmatizing concepts, and they demonstrate causal steering that alters outputs in ways linked to race. While ablation of race latents and anti-bias prompting reduce bias in toy vignette tasks, the effects are inconsistent and limited in more realistic clinical tasks, suggesting that race representations are entangled with clinical content and not easily removed without harming performance. Overall, SAEs offer a useful diagnostic tool for detecting problematic race associations in clinical LLMs, but their practical utility for robust bias mitigation in real-world tasks remains limited and task-dependent.

Abstract

LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in Gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to steer models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become "belligerent". We evaluate the degree to which such steering via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks. Overall, our results suggest that: SAEs may offer a useful tool in clinical applications of LLMs to identify problematic reliance on demographics but mitigating bias via SAE steering appears to be of marginal utility for realistic tasks.

Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

TL;DR

The paper investigates whether Sparse Autoencoders (SAEs) can reveal and mitigate racial biases in healthcare-oriented LLMs. By probing activations from Gemma-2 models on clinical notes, the authors identify a race-predictive latent (the Black latent) that activates on explicit race mentions and on stigmatizing concepts, and they demonstrate causal steering that alters outputs in ways linked to race. While ablation of race latents and anti-bias prompting reduce bias in toy vignette tasks, the effects are inconsistent and limited in more realistic clinical tasks, suggesting that race representations are entangled with clinical content and not easily removed without harming performance. Overall, SAEs offer a useful diagnostic tool for detecting problematic race associations in clinical LLMs, but their practical utility for robust bias mitigation in real-world tasks remains limited and task-dependent.

Abstract

LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in Gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to steer models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become "belligerent". We evaluate the degree to which such steering via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks. Overall, our results suggest that: SAEs may offer a useful tool in clinical applications of LLMs to identify problematic reliance on demographics but mitigating bias via SAE steering appears to be of marginal utility for realistic tasks.

Paper Structure

This paper contains 32 sections, 5 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Max-activating examples of Black latents in clinical discharge summaries. The latents activate on mentions of Black identity, which is intuitive. But they also reveal problematic associations like activating on cocaine (examples boxed in red).
  • Figure 2: Effect ($E$; Equation \ref{['eq:effect']}) of ablating race latents. Latent identifiers are on the y-axis (descriptions in Table \ref{['tab:gemma-race-latents']}). Race latents have a minimal effect on model outputs across tasks and models.
  • Figure 3: $\Delta_{\text{logitdiff}}$ before and after interventions. Prompting explicitly to not factor in patient race reduced bias in four out of five tasks, but over-corrects for cocaine abuse. SAE interventions marginally reduce bias in two tasks.
  • Figure 4: Neuronpedia screenshots for Black latent in gemma-2-2B
  • Figure 5: Neuronpedia screenshots for the Black latent in gemma-2-9B
  • ...and 1 more figures