Social Contract AI: Aligning AI Assistants with Implicit Group Norms
Jan-Philipp Fränken, Sam Kwok, Peixuan Ye, Kanishk Gandhi, Dilip Arumugam, Jared Moore, Alex Tamkin, Tobias Gerstenberg, Noah D. Goodman
TL;DR
This work addresses how to align AI assistants with implicit group norms by inferring user preferences from observed interactions rather than relying on fixed constitutional rules. It introduces Social Contract AI (SCAI), which uses a Bayesian, context-aware framework with verbal reinforcement to learn shared policies from ultimatum-game interactions and to adaptively rewrite local governing principles. The study demonstrates that the AI can align its offers with observed user behavior in controlled simulations but reveals challenges in robustness and generalization to out-of-distribution resources, as well as slower learning under inconsistent language cues. The findings illuminate the potential and limitations of simulating diverse user preferences to study practical alignment questions and suggest directions for scaling up such frameworks to more democratic, adaptable AI governance.”
Abstract
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.
