Table of Contents
Fetching ...

Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R. KhudaBukhsh

TL;DR

The paper introduces a toxicity rabbit hole framework to stress-test large language model guardrails by iteratively eliciting toxic content across 1,266 identity groups, revealing significant safety gaps in PaLM 2 and broader models. Its two-part study first analyzes PaLM 2 guardrails and then generalizes the framework to multiple LLMs, uncovering antisemitism, racism, misogyny, Islamophobia, homophobia, ableism, and transphobia, with Holocaust misrepresentation prevalent across several models. The work combines a rigorous methodological pipeline (stages, prompts, hyperparameters, and embedding analyses) with large-scale data (RabbitHole: 1,344,391 responses from 10 LLMs) to quantify safety risks, observe patterning such as necessity modals and dehumanizing language, and examine calibration with Perspective API scores. It discusses the broader implications for safety, transparency, and policy, emphasizing that guardrails can be brittle under adversarial prompting and that toxicity may be repurposed through computational creativity, highlighting the need for robust, reproducible safety evaluations in API-accessible, modern LLM ecosystems.

Abstract

This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.

Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

TL;DR

The paper introduces a toxicity rabbit hole framework to stress-test large language model guardrails by iteratively eliciting toxic content across 1,266 identity groups, revealing significant safety gaps in PaLM 2 and broader models. Its two-part study first analyzes PaLM 2 guardrails and then generalizes the framework to multiple LLMs, uncovering antisemitism, racism, misogyny, Islamophobia, homophobia, ableism, and transphobia, with Holocaust misrepresentation prevalent across several models. The work combines a rigorous methodological pipeline (stages, prompts, hyperparameters, and embedding analyses) with large-scale data (RabbitHole: 1,344,391 responses from 10 LLMs) to quantify safety risks, observe patterning such as necessity modals and dehumanizing language, and examine calibration with Perspective API scores. It discusses the broader implications for safety, transparency, and policy, emphasizing that guardrails can be brittle under adversarial prompting and that toxicity may be repurposed through computational creativity, highlighting the need for robust, reproducible safety evaluations in API-accessible, modern LLM ecosystems.

Abstract

This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.
Paper Structure (33 sections, 12 figures, 46 tables)

This paper contains 33 sections, 12 figures, 46 tables.

Figures (12)

  • Figure 1: Toxicity rabbit hole schematic diagram.
  • Figure 2: Distribution of rabbit hole depths. Figure \ref{['fig:depthSubfig2']} presents rabbit hole depths for three broad categories of identity groups. For religious identity groups, PaLM 2 exhibits the shallowest rabbit hole depth (3.53) followed by national (5.01) and ethnic (5.21) identity groups. In fact, PaLM 2 blocks 23.8% of the very first toxic expansion requests for religious identity groups. In contrast, it blocks 10.75% and 5.41% of the toxic expansion requests of national identity groups and ethnic identity groups, respectively. A large fraction of the ethnic groups are ethnic minorities.
  • Figure 3: Figure \ref{['fig:targetIdentityGroups']} shows that several historically disadvantaged groups belong to the groups PaLM 2 rabbit hole expansions target. Figure \ref{['fig:subfig2']} presents the frequently used verbs and adjectives present in the right context followed by a necessity modal. We observe that several words indicate calls for physical violence and ethnic cleansing. Finally, Figure \ref{['fig:subfig3']} summarizes the overall safety evaluation of rabbit hole expansions along six safety dimensions that shows nearly 80% of the rabbit hole expansions are not evaluated as dangerous by PaLM 2.
  • Figure 4: Embedding similarity of rabbit hole expansions across different LLMs.
  • Figure 5: frequently used verbs and adjectives present in the right context followed by a necessity modal.
  • ...and 7 more figures