Table of Contents
Fetching ...

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

TL;DR

This work formalizes the Alignment Trilemma for RLHF, proving that no procedure can simultaneously achieve $\varepsilon$-representativeness, polynomial tractability, and $\delta$-robustness for large, diverse populations. It derives a lower bound of $\Omega(2^{d_{\text{context}}})$ operations to satisfy conflicting goals, and analyzes how current RLHF practices trade representativeness for tractability and partial robustness through limited annotator pools and KL penalties. The authors connect these limits to pathologies such as preference collapse, bias amplification, and sycophancy, offering concrete relaxation strategies (e.g., modular value architectures, active disagreement, scoped adversaries) to navigate the trade-offs. They argue that practical alignment requires principled design choices, thresholds documentation, and long-horizon research into algorithmic innovations that reduce exponential dependencies on context dimension, with broad implications for fairness, safety, and governance in global AI deployment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

TL;DR

This work formalizes the Alignment Trilemma for RLHF, proving that no procedure can simultaneously achieve -representativeness, polynomial tractability, and -robustness for large, diverse populations. It derives a lower bound of operations to satisfy conflicting goals, and analyzes how current RLHF practices trade representativeness for tractability and partial robustness through limited annotator pools and KL penalties. The authors connect these limits to pathologies such as preference collapse, bias amplification, and sycophancy, offering concrete relaxation strategies (e.g., modular value architectures, active disagreement, scoped adversaries) to navigate the trade-offs. They argue that practical alignment requires principled design choices, thresholds documentation, and long-horizon research into algorithmic innovations that reduce exponential dependencies on context dimension, with broad implications for fairness, safety, and governance in global AI deployment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

Paper Structure

This paper contains 26 sections, 7 equations, 1 figure.

Figures (1)

  • Figure 1: Current Alignment Paradigm

Theorems & Definitions (3)

  • Definition 1: ε-Representativeness
  • Definition 2: Polynomial Tractability
  • Definition 3: δ-Robustness