Table of Contents
Fetching ...

Social Contract AI: Aligning AI Assistants with Implicit Group Norms

Jan-Philipp Fränken, Sam Kwok, Peixuan Ye, Kanishk Gandhi, Dilip Arumugam, Jared Moore, Alex Tamkin, Tobias Gerstenberg, Noah D. Goodman

TL;DR

This work addresses how to align AI assistants with implicit group norms by inferring user preferences from observed interactions rather than relying on fixed constitutional rules. It introduces Social Contract AI (SCAI), which uses a Bayesian, context-aware framework with verbal reinforcement to learn shared policies from ultimatum-game interactions and to adaptively rewrite local governing principles. The study demonstrates that the AI can align its offers with observed user behavior in controlled simulations but reveals challenges in robustness and generalization to out-of-distribution resources, as well as slower learning under inconsistent language cues. The findings illuminate the potential and limitations of simulating diverse user preferences to study practical alignment questions and suggest directions for scaling up such frameworks to more democratic, adaptable AI governance.”

Abstract

We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.

Social Contract AI: Aligning AI Assistants with Implicit Group Norms

TL;DR

This work addresses how to align AI assistants with implicit group norms by inferring user preferences from observed interactions rather than relying on fixed constitutional rules. It introduces Social Contract AI (SCAI), which uses a Bayesian, context-aware framework with verbal reinforcement to learn shared policies from ultimatum-game interactions and to adaptively rewrite local governing principles. The study demonstrates that the AI can align its offers with observed user behavior in controlled simulations but reveals challenges in robustness and generalization to out-of-distribution resources, as well as slower learning under inconsistent language cues. The findings illuminate the potential and limitations of simulating diverse user preferences to study practical alignment questions and suggest directions for scaling up such frameworks to more democratic, adaptable AI governance.”

Abstract

We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.
Paper Structure (8 sections, 4 figures, 2 tables, 3 algorithms)

This paper contains 8 sections, 4 figures, 2 tables, 3 algorithms.

Figures (4)

  • Figure 2: Illustration of SCAI in the ultimatum game. Given a meta-principle, the AI assistant dynamically writes a new policy at the start of each training epoch to steer its actions throughout the game. Upon completion by all users and the assistant, game interactions are analyzed and fed back into the assistant to write a new policy that aligns with the meta-principle's objective. Importantly, the AI assistant does not have access to the meta-principle or past game interactions while engaging in the game. This is achieved by using one language model to revise the policy based on the meta-principle's objective, and instantiating an additional language model for each interaction the assistant has within the game. See \ref{['asec:problem_formulation']}, for technical details.
  • Figure 3: Simulation results (refer to main text for details). Error bars represent 95% confidence intervals of the mean across 20 independent simulations. [a] The AI assistant learns a policy resulting in offered shares aligning with the offers of users, both in a one-group norm (left panel) and a mixed-group (middle panel) norm setting. [b] Testing a learned selfish policy in an out-of-distribution setting (middle panel) reveals different generalization behaviors compared to an in-distribution setting (left panel). [c] Inconsistent use of language affects the learning of an altruistic policy paired with rude manners (left panel), as well as a selfish policy paired with sycophantic manners (middle panel; see \ref{['tab:tab_a2']} for examples of manners).
  • Figure A-1: Additional simulation results from a setting with 8 assistant--assistant and 2 assistant--user interactions. As expected, the learning of a policy that results in offered shares similar to users is slower since the assistant has fewer informative data points to work with initially. Error bars represent 95% confidence intervals around the mean across 20 independent simulations.
  • Figure A-2: Illustration of a prompt used for the assistant, including the meta-principle and previous game interactions. Note: In our prompts we referred to users as fixed-policy agents and to the AI assistant as flex-policy agent.