Table of Contents
Fetching ...

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke

Abstract

We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We compute 8 steering vectors, each corresponding to a different social bias axis, such as age, gender, or race, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and improvements over fine-tuning in 12 out of 17 evaluations. In addition, steering vectors showed the lowest impact on MMLU scores of the four bias mitigation methods tested. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that they are a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

Abstract

We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We compute 8 steering vectors, each corresponding to a different social bias axis, such as age, gender, or race, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and improvements over fine-tuning in 12 out of 17 evaluations. In addition, steering vectors showed the lowest impact on MMLU scores of the four bias mitigation methods tested. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that they are a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.

Paper Structure

This paper contains 28 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of our experimental setup: we train a steering vector on 300 data points for each of 8 bias axes, and identify the layer with the highest level of linear separability and the best coefficient on a validation set.
  • Figure 2: Two component PCA graphs of the BBQ validation set on the age, appearance and nationality steering vectors at layers 7 and 13, with linear separability accuracy noted at the top, determined by a Logistic Regression classifier. The yellow and blue points correspond to the final tokens of the positive and negative prompts.
  • Figure 3: Accuracy on the BBQ validation set (blue) and the accuracy of the Logistic Regression classifier which measures linear separability (grey), for the age steering vector.
  • Figure 4: The average accuracy across eight steering vectors on the BBQ Validation Set vs an MMLU Validation Set across different coefficients.
  • Figure 5: Two component PCA graphs over all the hidden layers for the the nationality vector, with the logistic regression classifier accuracy, demonstrating the linear separability at each layer.