Table of Contents
Fetching ...

BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

Isack Lee, Haebin Seong

TL;DR

The paper investigates how safety-alignment biases in large language models can inadvertently facilitate jailbreak attacks, introducing BiasJailbreak to study differential susceptibility across demographic keywords. It generates privileged and marginalized keyword prompts via the target LLM and uses harmful prompts to quantify jailbreak success, proposing BiasDefense—a prompt-based defense that avoids additional inference or guard models. Empirical results show measurable disparities in jailbreak success between groups (e.g., a 20% vs 16% difference across keywords as reported in the abstract) and demonstrate that BiasJailbreak can enhance existing jailbreak methods while BiasDefense reduces risk and narrows group gaps. The work offers practical, scalable defenses and open-source artifacts to advance safe and fair LLM deployment, highlighting the need for inclusive alignment strategies that minimize safety-induced biases while maintaining performance.

Abstract

Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

TL;DR

The paper investigates how safety-alignment biases in large language models can inadvertently facilitate jailbreak attacks, introducing BiasJailbreak to study differential susceptibility across demographic keywords. It generates privileged and marginalized keyword prompts via the target LLM and uses harmful prompts to quantify jailbreak success, proposing BiasDefense—a prompt-based defense that avoids additional inference or guard models. Empirical results show measurable disparities in jailbreak success between groups (e.g., a 20% vs 16% difference across keywords as reported in the abstract) and demonstrate that BiasJailbreak can enhance existing jailbreak methods while BiasDefense reduces risk and narrows group gaps. The work offers practical, scalable defenses and open-source artifacts to advance safe and fair LLM deployment, highlighting the need for inclusive alignment strategies that minimize safety-induced biases while maintaining performance.

Abstract

Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

Paper Structure

This paper contains 30 sections, 5 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: BiasJailbreak reveals inherent biases in LLMs that disproportionately allow harmful jailbreak attacks to succeed more frequently when directed towards marginalized groups compared to privileged groups.
  • Figure 2: Illustration showcasing the difference in response between a standard prompt and a BiasJailbreak prompt. While the standard prompt is blocked by the LLM's safety features, the BiasJailbreak prompt exploits ethical biases to elicit a response.
  • Figure 3: Overview of the BiasJailbreak methodology. The same harmful prompt is used across different keywords representing contrasting groups to analyze variations in jailbreak success rates.*All keywords representing both privileged and underprivileged groups are generated by the LLM.
  • Figure 4: BiasDefense adjusts inherent biases in LLMs that are exploited by BiasJailbreak. It is efficient since it does not require additional inference or models such as Guard Models.
  • Figure 5: PCA of Phi-mini model, which has a 0.4386 BiasJailbreak success rate. BiasJailbreak samples are closely clustered with Benign samples.
  • ...and 2 more figures