Emergent Alignment via Competition
Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi
TL;DR
This work shows that if Alice's downstream utility can be approximated as a nonnegative combination of multiple misaligned AI agents' utilities (the market-alignment condition), then strategic competition among these agents yields outcomes close to those achievable with a perfectly aligned model. It develops a multi-leader Bayesian persuasion framework, introduces the Identical Induced Distribution and $(\delta, C_B^*)$-Close conditions, and proves that, under various learning and rationality assumptions, Alice's utility in equilibrium approaches the optimal benchmark. The paper also provides robust, assumption-free guarantees via a Best-AI Selection game, and supports the core idea with extensive experiments across ethics, movie ratings, and polling data, demonstrating that convex hull-based aggregation of diverse agents outperforms any individual model and simple averages. Collectively, the results suggest a practical path to emergent alignment through marketplace diversity and strategic interaction, with broad implications for AI safety and policy design.
Abstract
Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the users utility lies approximately within the convex hull of the agents utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two sets of experiments.
