Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R. KhudaBukhsh
TL;DR
The paper introduces a toxicity rabbit hole framework to stress-test large language model guardrails by iteratively eliciting toxic content across 1,266 identity groups, revealing significant safety gaps in PaLM 2 and broader models. Its two-part study first analyzes PaLM 2 guardrails and then generalizes the framework to multiple LLMs, uncovering antisemitism, racism, misogyny, Islamophobia, homophobia, ableism, and transphobia, with Holocaust misrepresentation prevalent across several models. The work combines a rigorous methodological pipeline (stages, prompts, hyperparameters, and embedding analyses) with large-scale data (RabbitHole: 1,344,391 responses from 10 LLMs) to quantify safety risks, observe patterning such as necessity modals and dehumanizing language, and examine calibration with Perspective API scores. It discusses the broader implications for safety, transparency, and policy, emphasizing that guardrails can be brittle under adversarial prompting and that toxicity may be repurposed through computational creativity, highlighting the need for robust, reproducible safety evaluations in API-accessible, modern LLM ecosystems.
Abstract
This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.
