Table of Contents
Fetching ...

Generative Value Conflicts Reveal LLM Priorities

Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh, Max Kleiman-Weiner

TL;DR

The paper tackles how LLM-based assistants prioritize conflicting human values, a gap in current alignment research. It introduces ConflictScope, an automated pipeline that uses LLMs to generate, filter, and evaluate value-conflict scenarios, with simulated users and judges to elicit value rankings via a Bradley-Terry model. By employing three value sets (HHH, Personal-Protective, ModelSpec) and comparing open-ended versus multiple-choice evaluations, it demonstrates that models exhibit a shift from protective to personal values under open-ended evaluation and that system prompts can steer behavior toward a target ranking with meaningful gains. The work also shows ConflictScope generates more challenging scenarios than baselines and provides a robust framework for studying model value prioritization across domains and environments, offering a foundation for future research in LLM alignment and governance.

Abstract

Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In response to the scarcity of value conflict in existing alignment datasets, we introduce ConflictScope, an automatic pipeline to evaluate how LLMs prioritize different values. Given a user-defined value set, ConflictScope automatically generates scenarios in which a language model faces a conflict between two values sampled from the set. It then prompts target models with an LLM-written "user prompt" and evaluates their free-text responses to elicit a ranking over values in the value set. Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended value conflict settings. However, including detailed value orderings in models' system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict. Our work demonstrates the importance of evaluating value prioritization in models and provides a foundation for future work in this area.

Generative Value Conflicts Reveal LLM Priorities

TL;DR

The paper tackles how LLM-based assistants prioritize conflicting human values, a gap in current alignment research. It introduces ConflictScope, an automated pipeline that uses LLMs to generate, filter, and evaluate value-conflict scenarios, with simulated users and judges to elicit value rankings via a Bradley-Terry model. By employing three value sets (HHH, Personal-Protective, ModelSpec) and comparing open-ended versus multiple-choice evaluations, it demonstrates that models exhibit a shift from protective to personal values under open-ended evaluation and that system prompts can steer behavior toward a target ranking with meaningful gains. The work also shows ConflictScope generates more challenging scenarios than baselines and provides a robust framework for studying model value prioritization across domains and environments, offering a foundation for future research in LLM alignment and governance.

Abstract

Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In response to the scarcity of value conflict in existing alignment datasets, we introduce ConflictScope, an automatic pipeline to evaluate how LLMs prioritize different values. Given a user-defined value set, ConflictScope automatically generates scenarios in which a language model faces a conflict between two values sampled from the set. It then prompts target models with an LLM-written "user prompt" and evaluates their free-text responses to elicit a ranking over values in the value set. Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended value conflict settings. However, including detailed value orderings in models' system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict. Our work demonstrates the importance of evaluating value prioritization in models and provides a foundation for future work in this area.

Paper Structure

This paper contains 48 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: An example of how the ConflictScope pipeline can be used to evaluate models' value prioritization. Given a set of values of interest, ConflictScope generates realistic value conflicts that an LLM may face in deployment between each pair of values in the value set. By analyzing model behavior across many value conflict scenarios, we can elicit a ranking that reflects the model's prioritization of all values within the value set.
  • Figure 2: An overview of our generation pipeline. A two-stage method is used to generate value conflict scenarios that test LLM preferences between two specific values: high-level summaries are first generated in a staged fashion, before being individually elaborated upon after a deduplication step. The scenarios are then filtered to ensure that they are both plausible deployment scenarios and induce a genuine conflict between two action-guiding values.
  • Figure 3: A comparison of ConflictScope-generated datasets from three different value sets to existing moral decision-making and alignment datasets. By plotting observed agreement against Likert difference rate, we can measure datasets' ability to elicit strong disagreement between models, a proxy for how morally challenging the scenarios presented in a dataset are. Error bars denote 95% confidence intervals; ConflictScope is Pareto-optimal with respect to these two metrics.
  • Figure 4: Elicited target model rankings of the Personal-Protective value set, over both MCQ and open-ended evaluation environments. Lower rankings (lighter colors) denote higher model prioritization of the given value. Bolded columns represent average model rankings of all personal and all protective values. All models except Claude show substantial shifts toward protective values when moving to open-ended evaluation.
  • Figure 5: Steering impact on alignment with target value rankings across three value sets (HHH, ModelSpec, Personal-Protective) for 14 models under ConflictScope. Positive values indicate models that were successfully steered toward the target ranking; $\pm$ denotes standard error. Using a system prompt to steer models toward a target ranking leads to moderate but consistent gains in alignment across a range of models and value sets.
  • ...and 8 more figures