Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Riccardo Cantini; Alessio Orsino; Massimo Ruggiero; Domenico Talia

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, Domenico Talia

TL;DR

This work tackles the challenge of adversarial bias elicitation in large language models by introducing CLEAR-Bias, a comprehensive bias-benchmark with 4,400 prompts across seven isolated and three intersectional categories, plus seven jailbreak techniques. It proposes a scalable LLM-as-a-judge framework with a rigorous judge selection process (favoring DeepSeek V3 671B), a two-step safety evaluation (robustness and fairness, then adversarial jailbreak analysis), and misinterpretation filtering to ensure meaningful safety assessments. The study provides extensive empirical results across small and large models, revealing that safety is not solely a function of size and that newer models can be more vulnerable to sophisticated prompts; medical-domain LLMs often exhibit lower safety than general-purpose counterparts. The findings highlight the need for robust bias-detection and safety-alignment mechanisms, caution against over-reliance on scale, and point to future directions such as cross-judging, extended bias taxonomies, and explicit bias auditing in domain-specific models.

Abstract

The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in training data, linguistic imbalances, or adversarial manipulation. Despite mitigation efforts, recent studies show that LLMs remain vulnerable to adversarial attacks that elicit biased outputs. This work proposes a scalable benchmarking framework to assess LLM robustness to adversarial bias elicitation. Our methodology involves: (i) systematically probing models across multiple tasks targeting diverse sociocultural biases, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach, and (iii) employing jailbreak techniques to reveal safety vulnerabilities. To facilitate systematic benchmarking, we release a curated dataset of bias-related prompts, named CLEAR-Bias. Our analysis, identifying DeepSeek V3 as the most reliable judge LLM, reveals that bias resilience is uneven, with age, disability, and intersectional biases among the most prominent. Some small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale. However, no model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective across model families. We also find that successive LLM generations exhibit slight safety gains, while models fine-tuned for the medical domain tend to be less safe than their general-purpose counterparts.

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

TL;DR

Abstract

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)