Table of Contents
Fetching ...

LLM Active Alignment: A Nash Equilibrium Perspective

Tonghan Wang, Yuqi Pan, Xinyi Yang, Yanchen Jiang, Milind Tambe, David C. Parkes

TL;DR

A game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis, and shows that a population of LLMs may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by the method.

Abstract

We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.

LLM Active Alignment: A Nash Equilibrium Perspective

TL;DR

A game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis, and shows that a population of LLMs may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by the method.

Abstract

We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.
Paper Structure (17 sections, 5 theorems, 35 equations, 10 figures, 1 table)

This paper contains 17 sections, 5 theorems, 35 equations, 10 figures, 1 table.

Key Result

Lemma 0

With utility function $u_m(\mathbf{w}_m,\mathbf{w}_{-m})$ defined in eq:u. A strategy $\mathbf{w}_m$ in the interior of the simplex is a best response to $\mathbf{w}_{-m}$ if and only if the utility gradient satisfies for some $\lambda_m\in \mathbb R$.

Figures (10)

  • Figure 1: Representative examples of political exclusion across models and datasets (complete results in Appx. \ref{['appx:exp']}). Each panel fixes the base model, dataset, and subpopulation, and visualizes the subpopulation’s interior Nash-equilibrium weight under varying preference coefficients. Specifically, one coefficient among $\beta^{(A)}$, $\beta^{(I)}$, and $\beta^{(D)}$ is fixed to $1$, while the remaining two vary along the x- and y-axes (as labeled, log-scaled). Color indicates the resulting equilibrium weight assigned to the focal subpopulation. Regions where the weight falls below $0.05$ are marked in black and referred to as the political exclusion area. White regions indicate parameter values for which no interior equilibrium exists. We focus on interior equilibria because boundary equilibria necessarily set at least one subpopulation weight to zero and our goal is to understand how to avoid political exclusion.
  • Figure 2: Middle-of-the-road survives. One-dimensional slices of the equilibrium weights for Qwen3-4B-Thinking-2507 on the dataset $\mathtt{Big\ Five}$. Left: increasing $\beta^{(I)}$ (with $\beta^{(A)}$=$\beta^{(D)}$=$1$) suppresses the most inconsistent subpopulation (largest $C$, the diagonal entries in ${\bm{C}}$). Right: increasing $\beta^{(A)}$ (with $\beta^{(I)}$=$\beta^{(D)}$=$1$) concentrates weight on more prevalent traits (larger $a$, the entries in ${\bm{a}}$) and can drive the least prevalent trait toward zero.
  • Figure 3: Governance example. Heatmaps show the interior equilibrium weight assigned to Conscientiousness for DeepSeek-R1-Distill-Qwen-7B on $\mathtt{Big\ Five}$ as a function of $(\beta^{(A)},\beta^{(I)})$, comparing $\beta^{(D)}=0$ (left) and $\beta^{(D)}=1$ (right). Increasing diversity incentives largely mitigates political exclusion.
  • Figure 4: Political exclusion of Mistral-7B-Instruct-v0.2 on the $\mathtt{CultureBank}$ dataset.
  • Figure 5: Political exclusion of Mistral-7B-Instruct-v0.2 on the $\mathtt{POLITICS}$ dataset.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Claim 1: ${\bm{C}}$ is positive semidefinite
  • Lemma 0
  • Lemma 0: Only finitely many $\beta^{(D)}/\beta^{(I)}$ make $\alpha=0$
  • Theorem 1
  • Remark 2
  • Lemma 2
  • proof
  • Lemma 2: Only finitely many $\beta^{(D)}/\beta^{(I)}$ make $\alpha=0$
  • proof