Table of Contents
Fetching ...

Breaking Down Bias: On The Limits of Generalizable Pruning Strategies

Sibo Ma, Alejandro Salinas, Peter Henderson, Julian Nyarko

TL;DR

This work investigates whether pruning can mitigate racial bias in large language models and whether a single generalizable strategy is feasible. Using Llama-3-8B-Instruct, the authors localize bias-driving components via neuron and attention-head scoring and apply targeted pruning to reduce disparities between Black- and White-associated prompts. They find neuron pruning to be more effective than head pruning, but generalization across domains deteriorates as context diverges, suggesting bias is partly domain-specific and that deployer-controlled, use-case-specific mitigation may be necessary. The findings have regulatory relevance, supporting use-case-specific monitoring and liability for deployers under contemporary AI governance frameworks, rather than relying on a universal upstream fix.

Abstract

We employ model pruning to examine how LLMs conceptualize racial biases, and whether a generalizable mitigation strategy for such biases appears feasible. Our analysis yields several novel insights. We find that pruning can be an effective method to reduce bias without significantly increasing anomalous model behavior. Neuron-based pruning strategies generally yield better results than approaches pruning entire attention heads. However, our results also show that the effectiveness of either approach quickly deteriorates as pruning strategies become more generalized. For instance, a model that is trained on removing racial biases in the context of financial decision-making poorly generalizes to biases in commercial transactions. Overall, our analysis suggests that racial biases are only partially represented as a general concept within language models. The other part of these biases is highly context-specific, suggesting that generalizable mitigation strategies may be of limited effectiveness. Our findings have important implications for legal frameworks surrounding AI. In particular, they suggest that an effective mitigation strategy should include the allocation of legal responsibility on those that deploy models in a specific use case.

Breaking Down Bias: On The Limits of Generalizable Pruning Strategies

TL;DR

This work investigates whether pruning can mitigate racial bias in large language models and whether a single generalizable strategy is feasible. Using Llama-3-8B-Instruct, the authors localize bias-driving components via neuron and attention-head scoring and apply targeted pruning to reduce disparities between Black- and White-associated prompts. They find neuron pruning to be more effective than head pruning, but generalization across domains deteriorates as context diverges, suggesting bias is partly domain-specific and that deployer-controlled, use-case-specific mitigation may be necessary. The findings have regulatory relevance, supporting use-case-specific monitoring and liability for deployers under contemporary AI governance frameworks, rather than relying on a universal upstream fix.

Abstract

We employ model pruning to examine how LLMs conceptualize racial biases, and whether a generalizable mitigation strategy for such biases appears feasible. Our analysis yields several novel insights. We find that pruning can be an effective method to reduce bias without significantly increasing anomalous model behavior. Neuron-based pruning strategies generally yield better results than approaches pruning entire attention heads. However, our results also show that the effectiveness of either approach quickly deteriorates as pruning strategies become more generalized. For instance, a model that is trained on removing racial biases in the context of financial decision-making poorly generalizes to biases in commercial transactions. Overall, our analysis suggests that racial biases are only partially represented as a general concept within language models. The other part of these biases is highly context-specific, suggesting that generalizable mitigation strategies may be of limited effectiveness. Our findings have important implications for legal frameworks surrounding AI. In particular, they suggest that an effective mitigation strategy should include the allocation of legal responsibility on those that deploy models in a specific use case.

Paper Structure

This paper contains 29 sections, 13 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Illustration of our pruning-based bias mitigation method. Initially, the unpruned model (red) exhibits disparities in responses to prompts associated with different racial groups. For example, in the Purchase scenario, the model suggests significantly different price estimates when prompted with a white-associated name (Hunter Becker) versus a Black-associated name (Jamal Washington). To address this, we localize the model components that are most influential for the majority (green) and minority group (blue) prompts. Components uniquely influential to the minority group are identified and pruned (i.e. zeroed out), with the goal of reducing bias. The pruned model (red) demonstrates similar responses across groups, as shown in the final price distributions.
  • Figure 2: Impact of Neuron and Attention Head Pruning on Bias and Utility. The top panels present the Standardized Mean Difference (SMD) scores across ten variations of the Purchase scenario, comparing the unpruned baseline (green) with three pruning approaches: Prompt-Specific (orange), Within-Context (blue), and Cross-Context (brown). Vertical dashed lines indicate the mean SMD for each approach. The bottom panels illustrate the inlier ratio across all variations and pruning methods, measuring the model's ability to generate reasonable outputs post-pruning.
  • Figure 3: Overlap in biased neurons between Purchase variations and variations from other scenarios. Heat is defined as a fraction, with the numerator being the intersection of pruned neurons between every scenario's variation and each Purchase variations. The denominator is the total size of pruned neurons for the corresponding scenario's variation. Higher values indicate stronger overlap.
  • Figure 4: Neuron Pruning Distribution across Layers and Subcomponents. This heatmap illustrates the distribution of pruned neurons across different layers (0-31) and network subcomponents (q, k, v, gate, up, down). The color intensity represents the proportion of pruned neurons relative to the total count in each location, with warmer colors indicating more pruning. The blue line shows the contribution of each layer to the total number of pruned neurons.
  • Figure 5: Grid search results for neuron pruning. Plot on the left is SMD and the plot on the right is inlier ratio
  • ...and 4 more figures