Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

Aran Nayebi

Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

Aran Nayebi

TL;DR

This work formalizes AI value alignment as a multi-objective consensus problem among $N$ agents over $M$ tasks using the $\langle M,N,\varepsilon,\delta\rangle$-agreement framework, without assuming a common prior. It establishes fundamental limits via information-theoretic lower bounds that scale with $M$, $N$, and state-space size $D$, demonstrating a No-Free-Lunch principle that full human-value encoding incurs intrinsic overheads. It complements these impossibility results with constructive upper bounds and explicit algorithms for both unbounded and bounded rationality, including discretized-message and sampling-tree variants, showing how convergence scales with $M$, $N$, $D$, $\varepsilon$, and $\delta$. The paper also analyzes practical implications under bounded rationality, revealing exponential-time barriers unless problem structure is exploited or objectives are compressed, and it presents principled guidance for scalable, safety-focused human–AI collaboration. Overall, the results delineate a trade-off frontier for alignment: reducing task breadth and exploiting structure are essential for tractable, scalable oversight in complex state spaces.

Abstract

We formalize AI alignment as a multi-objective optimization problem called $\langle M,N,\varepsilon,δ\rangle$-agreement, in which a set of $N$ agents (including humans) must reach approximate ($\varepsilon$) agreement across $M$ candidate objectives, with probability at least $1-δ$. Analyzing communication complexity, we prove an information-theoretic lower bound showing that once either $M$ or $N$ is large enough, no amount of computational power or rationality can avoid intrinsic alignment overheads. This establishes rigorous limits to alignment *itself*, not merely to particular methods, clarifying a "No-Free-Lunch" principle: encoding "all human values" is inherently intractable and must be managed through consensus-driven reduction or prioritization of objectives. Complementing this impossibility result, we construct explicit algorithms as achievability certificates for alignment under both unbounded and bounded rationality with noisy communication. Even in these best-case regimes, our bounded-agent and sampling analysis shows that with large task spaces ($D$) and finite samples, *reward hacking is globally inevitable*: rare high-loss states are systematically under-covered, implying scalable oversight must target safety-critical slices rather than uniform coverage. Together, these results identify fundamental complexity barriers -- tasks ($M$), agents ($N$), and state-space size ($D$) -- and offer principles for more scalable human-AI collaboration.

Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

TL;DR

This work formalizes AI value alignment as a multi-objective consensus problem among

agents over

tasks using the

-agreement framework, without assuming a common prior. It establishes fundamental limits via information-theoretic lower bounds that scale with

, and state-space size

, demonstrating a No-Free-Lunch principle that full human-value encoding incurs intrinsic overheads. It complements these impossibility results with constructive upper bounds and explicit algorithms for both unbounded and bounded rationality, including discretized-message and sampling-tree variants, showing how convergence scales with

, and

. The paper also analyzes practical implications under bounded rationality, revealing exponential-time barriers unless problem structure is exploited or objectives are compressed, and it presents principled guidance for scalable, safety-focused human–AI collaboration. Overall, the results delineate a trade-off frontier for alignment: reducing task breadth and exploiting structure are essential for tractable, scalable oversight in complex state spaces.

Abstract

We formalize AI alignment as a multi-objective optimization problem called

-agreement, in which a set of

agents (including humans) must reach approximate (

) agreement across

candidate objectives, with probability at least

. Analyzing communication complexity, we prove an information-theoretic lower bound showing that once either

is large enough, no amount of computational power or rationality can avoid intrinsic alignment overheads. This establishes rigorous limits to alignment *itself*, not merely to particular methods, clarifying a "No-Free-Lunch" principle: encoding "all human values" is inherently intractable and must be managed through consensus-driven reduction or prioritization of objectives. Complementing this impossibility result, we construct explicit algorithms as achievability certificates for alignment under both unbounded and bounded rationality with noisy communication. Even in these best-case regimes, our bounded-agent and sampling analysis shows that with large task spaces (

) and finite samples, *reward hacking is globally inevitable*: rare high-loss states are systematically under-covered, implying scalable oversight must target safety-critical slices rather than uniform coverage. Together, these results identify fundamental complexity barriers -- tasks (

), agents (

), and state-space size (

) -- and offer principles for more scalable human-AI collaboration.

Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

TL;DR

Abstract

Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (34)