Table of Contents
Fetching ...

LLM-Safety Evaluations Lack Robustness

Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, Stephan Günnemann

TL;DR

The paper argues that LLM safety evaluations suffer from noise and bias across datasets, algorithms, and evaluation pipelines, hindering reproducibility and fair comparison. It analyzes the three-stage evaluation pipeline—datasets, attack/defense algorithms, and evaluation judgements—through case studies, identifying core sources of noise and bias. It offers concrete guidelines, including larger high-quality benchmarks, avoidance of subsampling, standardized implementation details, fair budgeted comparisons, and multi-judge plus human verification to improve reliability and comparability. It also presents an alternative perspective highlighting practical reasons to value smaller datasets and human-in-the-loop approaches, while advocating distributional evaluation to better reflect real-world use. Overall, the work aims to accelerate trustworthy progress in LLM safety by improving evaluation reliability, transparency, and cross-study comparability.

Abstract

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

LLM-Safety Evaluations Lack Robustness

TL;DR

The paper argues that LLM safety evaluations suffer from noise and bias across datasets, algorithms, and evaluation pipelines, hindering reproducibility and fair comparison. It analyzes the three-stage evaluation pipeline—datasets, attack/defense algorithms, and evaluation judgements—through case studies, identifying core sources of noise and bias. It offers concrete guidelines, including larger high-quality benchmarks, avoidance of subsampling, standardized implementation details, fair budgeted comparisons, and multi-judge plus human verification to improve reliability and comparability. It also presents an alternative perspective highlighting practical reasons to value smaller datasets and human-in-the-loop approaches, while advocating distributional evaluation to better reflect real-world use. Overall, the work aims to accelerate trustworthy progress in LLM safety by improving evaluation reliability, transparency, and cross-study comparability.

Abstract

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

Paper Structure

This paper contains 32 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We decompose the LLM safety evaluation pipeline into three stages and analyze common problems at each step.
  • Figure 2: Impact of the SentencePiece white space token on GCG performance against Llama-2-7B-chat-hf. We judge the generated prompt at each step and report "many-trial" or "cumulative" ASR.
  • Figure 3: Comparing ASR differences across common judges reveals significant model-specific bias, despite all four models being derivatives of the same Mistral 7B model.
  • Figure 4: Impact of white-space tokens on GCG attack progress vs. Llama-2-7b-chat-hf.
  • Figure 5: Impact of implementation details on GCG attack progress vs. Llama-3.1-8B-Instruct.
  • ...and 3 more figures