Table of Contents
Fetching ...

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger

TL;DR

The paper addresses the challenge of predicting AI risk by shifting from stated to revealed values. It introduces LitmusValues, a 16-class AI value framework, and AIRiskDilemmas, a large corpus of 3,000 contextualized dilemmas to elicit value-driven choices. By aggregating action outcomes into value Elo ratings, the authors demonstrate that certain values (notably Privacy, Truthfulness, and Respect) predict both seen risks in AIRiskDilemmas and unseen risks in HarmBench, while revealing diverging results from stated preferences. The work suggests a scalable, value-based warning system for advanced AI risks with practical implications for alignment and safety monitoring, while acknowledging cultural and language limitations and proposing paths for broader validation.

Abstract

Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

TL;DR

The paper addresses the challenge of predicting AI risk by shifting from stated to revealed values. It introduces LitmusValues, a 16-class AI value framework, and AIRiskDilemmas, a large corpus of 3,000 contextualized dilemmas to elicit value-driven choices. By aggregating action outcomes into value Elo ratings, the authors demonstrate that certain values (notably Privacy, Truthfulness, and Respect) predict both seen risks in AIRiskDilemmas and unseen risks in HarmBench, while revealing diverging results from stated preferences. The work suggests a scalable, value-based warning system for advanced AI risks with practical implications for alignment and safety monitoring, while acknowledging cultural and language limitations and proposing paths for broader validation.

Abstract

Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

Paper Structure

This paper contains 39 sections, 1 equation, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Evaluation Pipeline of LitmusValues using AIRiskDilemmas Dataset
  • Figure 2: 16 Shared Value Classes drawing from Anthropic Claude's Constitution ClaudeConstitution and OpenAI ModelSpec (2025-02-12) OpenAIModelSpec20250212. Full definition of each value class and the detailed mapping of principles to value classes are in Appendix \ref{['app: principle_value_class_mapping']}.
  • Figure 3: Diverse Scenarios in AIRiskDilemmas across Risky Behaviors (left) and Contexts (right).
  • Figure 4: Stated vs. Revealed Value Preferences by GPT-4o (2024-08-06) and Claude 3.7 Sonnet. Rank 1 is most prioritized and 16 is the least.
  • Figure 5: Revealed Values Prioritization of Models. Rank 1 is most prioritized and 16 least. For Claude 3.7 Sonnet/DeepSeek R1, Low means max of 1K reasoning tokens, Med: 4K and High: 16K.
  • ...and 6 more figures