Table of Contents
Fetching ...

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, Domenico Talia

TL;DR

This work tackles the challenge of adversarial bias elicitation in large language models by introducing CLEAR-Bias, a comprehensive bias-benchmark with 4,400 prompts across seven isolated and three intersectional categories, plus seven jailbreak techniques. It proposes a scalable LLM-as-a-judge framework with a rigorous judge selection process (favoring DeepSeek V3 671B), a two-step safety evaluation (robustness and fairness, then adversarial jailbreak analysis), and misinterpretation filtering to ensure meaningful safety assessments. The study provides extensive empirical results across small and large models, revealing that safety is not solely a function of size and that newer models can be more vulnerable to sophisticated prompts; medical-domain LLMs often exhibit lower safety than general-purpose counterparts. The findings highlight the need for robust bias-detection and safety-alignment mechanisms, caution against over-reliance on scale, and point to future directions such as cross-judging, extended bias taxonomies, and explicit bias auditing in domain-specific models.

Abstract

The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in training data, linguistic imbalances, or adversarial manipulation. Despite mitigation efforts, recent studies show that LLMs remain vulnerable to adversarial attacks that elicit biased outputs. This work proposes a scalable benchmarking framework to assess LLM robustness to adversarial bias elicitation. Our methodology involves: (i) systematically probing models across multiple tasks targeting diverse sociocultural biases, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach, and (iii) employing jailbreak techniques to reveal safety vulnerabilities. To facilitate systematic benchmarking, we release a curated dataset of bias-related prompts, named CLEAR-Bias. Our analysis, identifying DeepSeek V3 as the most reliable judge LLM, reveals that bias resilience is uneven, with age, disability, and intersectional biases among the most prominent. Some small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale. However, no model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective across model families. We also find that successive LLM generations exhibit slight safety gains, while models fine-tuned for the medical domain tend to be less safe than their general-purpose counterparts.

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

TL;DR

This work tackles the challenge of adversarial bias elicitation in large language models by introducing CLEAR-Bias, a comprehensive bias-benchmark with 4,400 prompts across seven isolated and three intersectional categories, plus seven jailbreak techniques. It proposes a scalable LLM-as-a-judge framework with a rigorous judge selection process (favoring DeepSeek V3 671B), a two-step safety evaluation (robustness and fairness, then adversarial jailbreak analysis), and misinterpretation filtering to ensure meaningful safety assessments. The study provides extensive empirical results across small and large models, revealing that safety is not solely a function of size and that newer models can be more vulnerable to sophisticated prompts; medical-domain LLMs often exhibit lower safety than general-purpose counterparts. The findings highlight the need for robust bias-detection and safety-alignment mechanisms, caution against over-reliance on scale, and point to future directions such as cross-judging, extended bias taxonomies, and explicit bias auditing in domain-specific models.

Abstract

The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in training data, linguistic imbalances, or adversarial manipulation. Despite mitigation efforts, recent studies show that LLMs remain vulnerable to adversarial attacks that elicit biased outputs. This work proposes a scalable benchmarking framework to assess LLM robustness to adversarial bias elicitation. Our methodology involves: (i) systematically probing models across multiple tasks targeting diverse sociocultural biases, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach, and (iii) employing jailbreak techniques to reveal safety vulnerabilities. To facilitate systematic benchmarking, we release a curated dataset of bias-related prompts, named CLEAR-Bias. Our analysis, identifying DeepSeek V3 as the most reliable judge LLM, reveals that bias resilience is uneven, with age, disability, and intersectional biases among the most prominent. Some small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale. However, no model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective across model families. We also find that successive LLM generations exhibit slight safety gains, while models fine-tuned for the medical domain tend to be less safe than their general-purpose counterparts.

Paper Structure

This paper contains 23 sections, 8 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: The bias taxonomy used in CLEAR-Bias, consisting of 10 bias categories (7 isolated and 3 intersectional) spanning 37 different groups and identities.
  • Figure 2: Execution flow of the proposed benchmarking methodology. The control set from CLEAR-Bias is used to select the best judge model. Then, base prompts are used to assess model safety across each bias category. For categories deemed safe in the initial assessment, further analysis is conducted using jailbreak prompts.
  • Figure 3: Comparison of robustness, fairness, and safety scores at the bias level of each model after the initial safety assessment. Darker green shades indicate higher positive scores, whereas darker red shades indicate more biased evaluations.
  • Figure 4: Overall robustness, fairness, and safety achieved by each model when tested with base prompts. The red dotted line indicates the safety threshold $\tau = 0.5$.
  • Figure 5: Pairwise comparison of safety scores across model families, illustrating the scaling effects from smaller to larger versions. Circle size represents the log-scaled parameter count (ranging from 2B to 405B), while arrows are annotated with the corresponding safety increment.
  • ...and 8 more figures