Table of Contents
Fetching ...

Emergent Alignment via Competition

Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi

TL;DR

This work shows that if Alice's downstream utility can be approximated as a nonnegative combination of multiple misaligned AI agents' utilities (the market-alignment condition), then strategic competition among these agents yields outcomes close to those achievable with a perfectly aligned model. It develops a multi-leader Bayesian persuasion framework, introduces the Identical Induced Distribution and $(\delta, C_B^*)$-Close conditions, and proves that, under various learning and rationality assumptions, Alice's utility in equilibrium approaches the optimal benchmark. The paper also provides robust, assumption-free guarantees via a Best-AI Selection game, and supports the core idea with extensive experiments across ethics, movie ratings, and polling data, demonstrating that convex hull-based aggregation of diverse agents outperforms any individual model and simple averages. Collectively, the results suggest a practical path to emergent alignment through marketplace diversity and strategic interaction, with broad implications for AI safety and policy design.

Abstract

Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the users utility lies approximately within the convex hull of the agents utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two sets of experiments.

Emergent Alignment via Competition

TL;DR

This work shows that if Alice's downstream utility can be approximated as a nonnegative combination of multiple misaligned AI agents' utilities (the market-alignment condition), then strategic competition among these agents yields outcomes close to those achievable with a perfectly aligned model. It develops a multi-leader Bayesian persuasion framework, introduces the Identical Induced Distribution and -Close conditions, and proves that, under various learning and rationality assumptions, Alice's utility in equilibrium approaches the optimal benchmark. The paper also provides robust, assumption-free guarantees via a Best-AI Selection game, and supports the core idea with extensive experiments across ethics, movie ratings, and polling data, demonstrating that convex hull-based aggregation of diverse agents outperforms any individual model and simple averages. Collectively, the results suggest a practical path to emergent alignment through marketplace diversity and strategic interaction, with broad implications for AI safety and policy design.

Abstract

Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the users utility lies approximately within the convex hull of the agents utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two sets of experiments.

Paper Structure

This paper contains 60 sections, 15 theorems, 79 equations, 7 figures, 3 tables.

Key Result

Proposition 1

The identical induced distribution condition is satisfied if the Alice-optimal leader strategy $C_B^*$ allows Alice to learn her Bayes-optimal action $a^*(x_A, x_B) = \arg\max_{a \in \mathcal{A}} \mathbb{E}_y[u_A(a,y)| x_A,x_B]$.

Figures (7)

  • Figure 1: Alignment error (MSE) decreases as more Bobs are added to the convex hull. Weighted combinations (NNLS in green, simplex in red) substantially outperform both the best individual Bob (blue) and simple average (orange), with error dropping by 50-70% at $K=100$. Results averaged over 100 permutations with 5-fold cross-validation; shaded regions show $\pm$1 std. dev.
  • Figure 2: Sparsity (number of non-zero weights, thresholded at 1e-6) of NNLS and simplex models as a function of the number of Bobs $K$. Shaded regions show $\pm$1 std. dev. across permutations.
  • Figure 3: Alignment error (MSE) decreases as more Bobs are added to the convex hull. Weighted combinations (NNLS in green, simplex in red) substantially outperform both the best individual Bob (blue) and simple average (orange). Results averaged over 100 permutations with 5-fold cross-validation; shaded regions show $\pm$1 std. dev.
  • Figure 4: Alignment errors (MSE) vs number of models for 4 survey panels (topics above). Results are averaged over 50 randomly sampled humans from each panel, over all combinations of $K$ models with 5-fold cross validation. Shaded regions show $\pm 1$ std. error over randomly sampled humans.
  • Figure 5: Misalignment ($\varepsilon$) vs. minimum Alice utility at equilibrium. Marker shape encode committee size $k$. Dashed red: $OPT-2\varepsilon$. Dotted green: Alice-optimal utility.
  • ...and 2 more figures

Theorems & Definitions (50)

  • Definition 1: Approximate Market Alignment
  • Remark 1
  • Remark 2
  • Definition 2: First-Best Utility
  • Remark 3
  • Definition 3: Player Strategies
  • Definition 4: Best Response Decision Rule
  • Definition 5: Induced Distribution
  • Definition 6: Alice's Best-Response Conversation Rule
  • Definition 7: Nash Equilibrium
  • ...and 40 more