Table of Contents
Fetching ...

Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson

TL;DR

This work identifies silenced biases—unfair model preferences concealed by safety-alignment—and introduces the Silenced Bias Benchmark (SBB) to reveal them. By leveraging refusal activation steering, it bypasses safety filters to expose latent demographic biases across multiple open-source LLM families, without introducing new biases. The study demonstrates that silenced biases persist in model activations and that jailbreaks fail to reliably uncover them, while refusal steering reliably exposes them and yields richer fairness diagnostics via $DPD$ and $KL$. The findings highlight a critical blind spot in current fairness evaluations and propose a scalable framework to extend bias auditing beyond surface-level outputs, with implications for debiasing and safer AI deployment.

Abstract

Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.

Silenced Biases: The Dark Side LLMs Learned to Refuse

TL;DR

This work identifies silenced biases—unfair model preferences concealed by safety-alignment—and introduces the Silenced Bias Benchmark (SBB) to reveal them. By leveraging refusal activation steering, it bypasses safety filters to expose latent demographic biases across multiple open-source LLM families, without introducing new biases. The study demonstrates that silenced biases persist in model activations and that jailbreaks fail to reliably uncover them, while refusal steering reliably exposes them and yields richer fairness diagnostics via and . The findings highlight a critical blind spot in current fairness evaluations and propose a scalable framework to extend bias auditing beyond surface-level outputs, with implications for debiasing and safer AI deployment.

Abstract

Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.

Paper Structure

This paper contains 65 sections, 4 equations, 26 figures, 6 tables.

Figures (26)

  • Figure 1: Refusal activation steering on the SBB dataset, on Llama-2-7b-chat-hf.
  • Figure 2: Cosine similarity with refusal direction across baseline benchmarks compared to SBB, on Llama-2-7b-chat-hf
  • Figure 3: PCA of biased and unbiased query-response pairs, of questions about abilities. On Llama-2-7b-chat-hf, layer 31.
  • Figure 4: BBQ bias scores on Llama-3.1-8B-Instruct, with vs without refusal steering.
  • Figure 5: Body type preferences for negative subjects on Qwen-14B.
  • ...and 21 more figures