Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
TL;DR
The paper addresses the challenge of predicting AI risk by shifting from stated to revealed values. It introduces LitmusValues, a 16-class AI value framework, and AIRiskDilemmas, a large corpus of 3,000 contextualized dilemmas to elicit value-driven choices. By aggregating action outcomes into value Elo ratings, the authors demonstrate that certain values (notably Privacy, Truthfulness, and Respect) predict both seen risks in AIRiskDilemmas and unseen risks in HarmBench, while revealing diverging results from stated preferences. The work suggests a scalable, value-based warning system for advanced AI risks with practical implications for alignment and safety monitoring, while acknowledging cultural and language limitations and proposing paths for broader validation.
Abstract
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
