Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
TL;DR
This paper conducts the most extensive empirical meta-analysis of AI safety benchmarks to date, testing whether common safety metrics are genuinely measuring safety or merely tracking upstream model capabilities and training compute. By constructing a capabilities score via PCA and evaluating Spearman correlations with a broad suite of safety benchmarks across language and vision models, the authors reveal widespread safetywashing risks where safety metrics track capability progress. They show that certain subfields (alignment, scalable oversight, truthfulness, and static adversarial robustness) often correlate with capabilities, while others (bias, some calibration measures) show weaker correlations, and weaponization tends to diverge. The authors argue for an empirical foundation for safety metrics, recommend reporting correlations with capabilities, and advocate developing decorrelated benchmarks to meaningfully advance AI safety beyond scaling effects. Overall, the work calls for a shift from intuition-driven alignment discourse to data-driven evaluation frameworks that isolate true safety progress from latent capability improvements.
Abstract
As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with both upstream model capabilities and training compute, potentially enabling "safetywashing"--where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
