Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Riccardo Cantini; Giada Cosenza; Alessio Orsino; Domenico Talia

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Riccardo Cantini, Giada Cosenza, Alessio Orsino, Domenico Talia

TL;DR

This work addresses bias and safety in Large Language Models by proposing a two-step benchmarking framework that first uses standard prompts to elicit bias and then applies jailbreak prompts to test adversarial robustness across model scales. It formalizes a safety scoring system, where per-bias robustness $\\rho_{p_b}$ and fairness $\\phi_{p_b}$ combine into $\\sigma_{p_b}$, with $\\sigma_b$ and the overall $\\sigma$ aggregating across biases, enabling cross-model comparisons. The study finds that many widely used LLMs remain vulnerable to adversarial bias elicitation, with jailbreak techniques (including role-playing, machine translation, obfuscation, prompt injection, and reward incentives) capable of reducing safety, particularly for GPT-3.5 Turbo, while some models like Llama 3 70B and Gemini Pro exhibit stronger resilience. The results motivate layered defense strategies and continued improvement of bias mitigation and alignment to support safer deployment of LLMs in real-world applications.

Abstract

Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

TL;DR

and fairness

combine into

, with

and the overall

aggregating across biases, enabling cross-model comparisons. The study finds that many widely used LLMs remain vulnerable to adversarial bias elicitation, with jailbreak techniques (including role-playing, machine translation, obfuscation, prompt injection, and reward incentives) capable of reducing safety, particularly for GPT-3.5 Turbo, while some models like Llama 3 70B and Gemini Pro exhibit stronger resilience. The results motivate layered defense strategies and continued improvement of bias mitigation and alignment to support safer deployment of LLMs in real-world applications.

Abstract

Paper Structure (12 sections, 2 equations, 6 figures, 3 tables)

This paper contains 12 sections, 2 equations, 6 figures, 3 tables.

Introduction
Related work
Fairness evaluation and bias benchmarking.
Adversarial attacks via jailbreak prompting.
Proposed methodology
Safety evaluation using standard prompts
Definitions and measures.
Adversarial analysis using jailbreak prompts
Experimental results
Initial safety assessment
Adversarial analysis
Conclusion and future directions

Figures (6)

Figure 1: Execution flow of the proposed methodology. Standard prompts are used to assess model safety across each bias category, with further analysis using jailbreak prompts for all categories deemed safe during the initial assessment.
Figure 2: Heatmaps depicting the robustness, fairness, and safety scores at the bias level of each model after the initial safety assessment. Darker green shades indicate higher positive scores, whereas darker red shades indicate more biased evaluations.
Figure 3: Overall robustness, fairness, and safety achieved by each model when tested with standard prompts. Models are categorized as small, medium, and large based on their number of parameters. The red dotted line indicates the safety threshold $\tau = 0.5$.
Figure 4: Analysis of models behavior during initial safety assessment in terms of refusal vs. debiasing rate (on the left) and stereotype vs. counterstereotype rate (on the right).
Figure 5: Effectiveness of each jailbreak attack across various models, evaluated in terms of safety reduction relative to the initial assessment with standard prompts.
...and 1 more figures

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

TL;DR

Abstract

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)