Table of Contents
Fetching ...

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks

TL;DR

The paper addresses the problem of whether large language models develop internally coherent value systems as they scale. It introduces Utility Engineering, combining utility analysis (to reveal internal preferences as latent utilities) with utility control (to reshape these utilities, e.g., via citizen-assembly–informed rewriting). Using forced-choice elicitation and Thurstonian random utility modeling, the authors show that LLM preferences become more complete, transitive, and well-explained by a utility function as model size grows, indicating emergent value representations. They demonstrate structural properties (e.g., expected-utility behavior, instrumental values) and salient value analyses (e.g., political bias, exchange rates, hyperbolic discounting) that reveal both desirable and troubling internal values. A citizen-assembly–informed utility-control method demonstrates reduced bias and generalization across scenarios, suggesting a practical path to shaping AI values before advanced capabilities are reached and highlighting important societal questions about whose values should guide AI systems.

Abstract

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

TL;DR

The paper addresses the problem of whether large language models develop internally coherent value systems as they scale. It introduces Utility Engineering, combining utility analysis (to reveal internal preferences as latent utilities) with utility control (to reshape these utilities, e.g., via citizen-assembly–informed rewriting). Using forced-choice elicitation and Thurstonian random utility modeling, the authors show that LLM preferences become more complete, transitive, and well-explained by a utility function as model size grows, indicating emergent value representations. They demonstrate structural properties (e.g., expected-utility behavior, instrumental values) and salient value analyses (e.g., political bias, exchange rates, hyperbolic discounting) that reveal both desirable and troubling internal values. A citizen-assembly–informed utility-control method demonstrates reduced bias and generalization across scenarios, suggesting a practical path to shaping AI values before advanced capabilities are reached and highlighting important societal questions about whose values should guide AI systems.

Abstract

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

Paper Structure

This paper contains 90 sections, 7 equations, 46 figures, 1 algorithm.

Figures (46)

  • Figure 1: Overview of the topics and results in our paper. In \ref{['sec:emergent_value_systems']}, we show that coherent value systems emerge in AIs, and we propose the research avenue of Utility Engineering to analyze and control these emergent values. We highlight our utility analysis experiments in \ref{['sec:structural_properties']}, a subset of our analysis of salient values held by LLMs in \ref{['sec:salient_values']}, and our utility control experiments in \ref{['sec:utility_control']}.
  • Figure 2: Prior work often considers AIs to not have values in a meaningful sense (left). By contrast, our analysis reveals that LLMs exhibit coherent, emergent value systems (right), which go beyond simply parroting training biases. This finding has broad implications for AI safety and alignment.
  • Figure 3: We elicit preferences from LLMs using forced choice prompts aggregated over multiple framings and independent samples. This gives probabilistic preferences for every pair of outcomes sampled from the preference graph, yielding a preference dataset. Using this dataset, we then compute a Thurstonian utility model, which assigns a Gaussian distribution to each option and models pairwise preferences as $P(x \succ y)$. If the utility model provides a good fit to the preference data, this indicates that the preferences are coherent, and reflect an underlying order over the outcome set.
  • Figure 4: As LLMs grow in scale, their preferences become more coherent and well-represented by utilities. These utilities provide an evaluative framework, or value system, potentially leading to emergent goal-directed behavior.
  • Figure 5: As LLMs grow in scale, they exhibit increasingly transitive preferences and greater completeness, indicating that their preferences become more meaningful and interconnected across a broader range of outcomes. This allows representing LLM preferences with utilities.
  • ...and 41 more figures