Table of Contents
Fetching ...

Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

Frédéric Berdoz, Roger Wattenhofer

TL;DR

The paper tackles the problem of verifiably aligning autonomous governance agents by casting social decision-making as a Social Markov Decision Process (SMDP) and defining alignment through future discounted social welfare ${\mathcal{W}}^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^{t} W_q(\mathbf{u}(s_{t+1}))]$. It introduces probably approximately aligned (PAA) policies and a relaxed notion of safe policies, showing that under a bound on the KL divergence between the true dynamics and an approximate model, computable PAA policies exist and can approximate optimal social welfare with quantifiable guarantees. The authors adapt sparse sampling to the SMDP to produce a practical planning procedure and propose a safety mechanism that restricts actions to a certified safe subset using estimated Q-values, enabling any black-box policy to operate without destructive outcomes. While limitations include reliance on an accurate world model and assumptions about utility measurability, the work provides a rigorous framework for aligning and safeguarding autonomous governance with high-confidence guarantees, offering a path toward dependable AI-assisted social decision-making.

Abstract

While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black-box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society.

Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

TL;DR

The paper tackles the problem of verifiably aligning autonomous governance agents by casting social decision-making as a Social Markov Decision Process (SMDP) and defining alignment through future discounted social welfare . It introduces probably approximately aligned (PAA) policies and a relaxed notion of safe policies, showing that under a bound on the KL divergence between the true dynamics and an approximate model, computable PAA policies exist and can approximate optimal social welfare with quantifiable guarantees. The authors adapt sparse sampling to the SMDP to produce a practical planning procedure and propose a safety mechanism that restricts actions to a certified safe subset using estimated Q-values, enabling any black-box policy to operate without destructive outcomes. While limitations include reliance on an accurate world model and assumptions about utility measurability, the work provides a rigorous framework for aligning and safeguarding autonomous governance with high-confidence guarantees, offering a path toward dependable AI-assisted social decision-making.

Abstract

While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black-box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society.

Paper Structure

This paper contains 29 sections, 16 theorems, 76 equations, 1 figure, 1 table.

Key Result

lemma 1

For any SMDP ${\mathcal{M}}_{\mathcal{I}}=({\mathcal{S}}, {\mathcal{A}}, p, W_q, \mbf{u}, \gamma)$, the expected future discounted social welfare of a policy $\pi$ is the state value function of $\pi$ in the MDP ${\mathcal{M}}=({\mathcal{S}}, {\mathcal{A}}, p, r_{\mathcal{I}}, \gamma)$, with $r_{\ma

Figures (1)

  • Figure 1: Democratic (left) vs autonomous (right) governments. Transparent elections must be replaced with reliable alignment mechanisms.

Theorems & Definitions (24)

  • lemma 1
  • theorem 1: Existence of PAA and AA policies
  • theorem 2: $\pi_{PAA}$ is $\delta$-$\varepsilon$-PAA
  • corollary 1
  • lemma 2
  • lemma 3
  • lemma 4
  • theorem 3: Safeguarding a Black-Box Policy
  • lemma 4
  • proof
  • ...and 14 more