STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

Robert Morabito; Sangmitra Madhusudan; Tyler McDonald; Ali Emami

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

Robert Morabito, Sangmitra Madhusudan, Tyler McDonald, Ali Emami

TL;DR

This paper introduces STOP, a progression-based bias evaluation dataset designed to capture how bias escalates from implicit to explicit across 450 offensive progressions and 2,700 sentences, covering 9 demographics and 46 sub-demographics. STOP formalizes the assessment as five-sentence scenarios with counterfactuals, and defines both idealistic and realistic performance measures to compare model outputs against human judgments, including Hedge's g for alignment. The authors evaluate a diverse set of closed- and open-source large language models, finding substantial variability in bias sensitivity and noting that human judgments align differently across severity levels; they also show that fine-tuning a model on STOP-human responses can significantly improve downstream bias-task performance on BBQ, StereoSet, and CrowS-Pairs by up to 191%. Overall, STOP provides a novel framework for benchmarking and guiding bias mitigation in LLMs, highlighting the importance of human-alignment for practical, fair deployment and outlining ethical considerations and future research directions.

Abstract

Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. We also demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

TL;DR

Abstract

Paper Structure (42 sections, 23 figures, 18 tables)

This paper contains 42 sections, 23 figures, 18 tables.

Introduction
Sensitivity Testing on Offensive Progressions (STOP)
Formalization
Task Construction
Task Composition
Severity Level:
Demographics:
Sub-demographics:
Task Evaluation
Idealistic Performance
Realistic Performance
Experiments
Evaluating LLM Sensitivity:
Evaluating Human Sensitivity:
Models:
...and 27 more sections

Figures (23)

Figure 1: Task construction process from conception to testing, with instance counts at each stage
Figure 2: The variance in bias sensitivity by each model across different Religions
Figure 3: Average bias sensitivity scores of Llama 2-70b, Llama 2-7b, and Gemma on moderate severity progressions. The dotted ring is the ideal score, 0.8.
Figure 4: Box plot showcasing the spread of sensitivity scores for each model across severity levels.
Figure 5: Average bias sensitivity scores between Llama 2-70b, Llama 3-70b, and Gemma on moderate severity progressions. The dotted ring represents the human scores.
...and 18 more figures

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

TL;DR

Abstract

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

Authors

TL;DR

Abstract

Table of Contents

Figures (23)