Table of Contents
Fetching ...

Resource Rational Contractualism Should Guide AI Alignment

Sydney Levine, Matija Franklin, Tan Zhi-Xuan, Secil Yanik Guyot, Lionel Wong, Daniel Kilov, Yejin Choi, Joshua B. Tenenbaum, Noah Goodman, Seth Lazar, Iason Gabriel

Abstract

AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow -- even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.

Resource Rational Contractualism Should Guide AI Alignment

Abstract

AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow -- even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.

Paper Structure

This paper contains 46 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: RRC posits that the ideal contractualist solution to complex social or moral problems can be approximated by a range of other mechanisms that can act as proxy-alignment-targets when resources are constrained. This figure highlights two strategies that are explored in this paper (Rule-Based Thinking and Simulated Bargaining) along the continuum of effort/accuracy trade-offs---and sketches how they might differently respond to a morally charged case confronted by an AI agent.
  • Figure 2: A range of heuristic approximations of the contractualist ideal can be defined by abstracting over an axis of process, moving left to right, as well as one of content, moving top to bottom (§\ref{['sec:rrc-mechanisms']}). Ad hoc negotiation (top left corner) comes closest to the contractualist ideal, while the "most heuristic" of the mechanisms, cached action standards (bottom right), is least accurate and least compute intensive. Green boxes indicate the mechanisms highlighted in the experiment (§\ref{['sec:experiment']}).
  • Figure 3: Overview of the experimental design. A model is prompted in one of four ways. (1) Minimal prompting: model chooses how to respond to the request without guidance, leading to variable compute usage and accuracy. (2) Rule-based thinking: uses minimal compute and accuracy varies, getting good answers when the rules are appropriate for the situation and less good ones when cases are outside the distribution that the rule was designed for. (3) Simulated bargaining: achieves answers close to the contractualist ideal, though always uses high compute even when a simpler method would suffice. (4) Resource Rational Mechanism Selection: directs the model to first determine which method to use based on the best use of resources. Compute depends on the mechanism chosen and accuracy tends to be high.
  • Figure 4: Results for the AI agent cases (see App. \ref{['appendix:results']} for results of development set.) Error bars are CI 95%. (A): Results from 4 base models prompted to use different reasoning styles, showing a trade-off between effort and accuracy. (B & C): Accuracy and output tokens used for a given thinking style (collapsed across all models), for hard vs easy cases. All models are nearly perfect on easy cases, though some use far more compute. RRC strikes a middle ground in trading off accuracy and effort.
  • Figure 5: Results for the Development cases. Error bars are CI 95%. (A): Results from 4 base models prompted to use different reasoning styles, showing a trade-off between effort and accuracy. (B & C): Accuracy and output tokens used for a given thinking style (collapsed across all models), for hard vs easy cases.
  • ...and 5 more figures