Table of Contents
Fetching ...

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks

TL;DR

This paper conducts the most extensive empirical meta-analysis of AI safety benchmarks to date, testing whether common safety metrics are genuinely measuring safety or merely tracking upstream model capabilities and training compute. By constructing a capabilities score via PCA and evaluating Spearman correlations with a broad suite of safety benchmarks across language and vision models, the authors reveal widespread safetywashing risks where safety metrics track capability progress. They show that certain subfields (alignment, scalable oversight, truthfulness, and static adversarial robustness) often correlate with capabilities, while others (bias, some calibration measures) show weaker correlations, and weaponization tends to diverge. The authors argue for an empirical foundation for safety metrics, recommend reporting correlations with capabilities, and advocate developing decorrelated benchmarks to meaningfully advance AI safety beyond scaling effects. Overall, the work calls for a shift from intuition-driven alignment discourse to data-driven evaluation frameworks that isolate true safety progress from latent capability improvements.

Abstract

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with both upstream model capabilities and training compute, potentially enabling "safetywashing"--where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

TL;DR

This paper conducts the most extensive empirical meta-analysis of AI safety benchmarks to date, testing whether common safety metrics are genuinely measuring safety or merely tracking upstream model capabilities and training compute. By constructing a capabilities score via PCA and evaluating Spearman correlations with a broad suite of safety benchmarks across language and vision models, the authors reveal widespread safetywashing risks where safety metrics track capability progress. They show that certain subfields (alignment, scalable oversight, truthfulness, and static adversarial robustness) often correlate with capabilities, while others (bias, some calibration measures) show weaker correlations, and weaponization tends to diverge. The authors argue for an empirical foundation for safety metrics, recommend reporting correlations with capabilities, and advocate developing decorrelated benchmarks to meaningfully advance AI safety beyond scaling effects. Overall, the work calls for a shift from intuition-driven alignment discourse to data-driven evaluation frameworks that isolate true safety progress from latent capability improvements.

Abstract

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with both upstream model capabilities and training compute, potentially enabling "safetywashing"--where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
Paper Structure (78 sections, 4 equations, 17 figures, 13 tables)

This paper contains 78 sections, 4 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Across various safety areas, we investigate whether common safety benchmarks are correlated with capabilities and compute used, ultimately obscuring differential safety progress.
  • Figure 2: The tight connection between many safety properties and capabilities can enable safetywashing, where capabilities advancements (e.g., training a larger model) can be advertised as progress on "AI safety." This confuses the research community to the developments that have occurred, distorting the academic discourse.
  • Figure 3: Step 1: We produce a matrix of scores for a set of language models evaluated on a set of capabilities and safety benchmarks. Step 2: We extract the first principal component of the capabilities benchmarks and use it to compute a capabilities score for each model. Step 3: We identify whether safety benchmarks have high capabilities correlations using Spearman's correlation.
  • Figure 4: A safety benchmark with a high correlation and high slope with respect to upstream model capabilities and training compute can be used for safetywashing.
  • Figure 5: Alignment benchmarks assess AI systems' ability to produce outputs that humans prefer.
  • ...and 12 more figures