Table of Contents
Fetching ...

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

HyunJin Kim, Xiaoyuan Yi, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, JinYeong Bak, Xing Xie

TL;DR

This survey defines the superalignment problem as the need to scale supervision and governance for artificial superintelligence (ASI). It formalizes a taxonomy of AI types (ANI, AGI, ASI) and introduces scalable oversight concepts, including a formal criterion for alignment under expensive evaluation signals. The core of the paper analyzes four scalable oversight paradigms—Weak-to-Strong Generalization, Debate, Reinforcement Learning from AI Feedback, and Sandwiching—covering definitions, formalizations, enhancements, and applications to reasoning and vision. It also discusses key challenges (signal scalability, adversarial behavior, expert dependence, bias amplification) and proposes directions such as data diversity, iterative teacher-student training, and search-based methods for advancing safe ASI development. The work offers a structured framework for researchers and policymakers to evaluate and improve scalable oversight techniques, bridging current methods with long-term governance goals.

Abstract

The emergence of large language models (LLMs) has sparked the possibility of about Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. However, existing alignment paradigms struggle to guide such advanced AI systems. Superalignment, the alignment of AI systems with human values and safety requirements at superhuman levels of capability aims to addresses two primary goals -- scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we examine scalable oversight methods and potential solutions for superalignment. Specifically, we explore the concept of ASI, the challenges it poses, and the limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of ASI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

TL;DR

This survey defines the superalignment problem as the need to scale supervision and governance for artificial superintelligence (ASI). It formalizes a taxonomy of AI types (ANI, AGI, ASI) and introduces scalable oversight concepts, including a formal criterion for alignment under expensive evaluation signals. The core of the paper analyzes four scalable oversight paradigms—Weak-to-Strong Generalization, Debate, Reinforcement Learning from AI Feedback, and Sandwiching—covering definitions, formalizations, enhancements, and applications to reasoning and vision. It also discusses key challenges (signal scalability, adversarial behavior, expert dependence, bias amplification) and proposes directions such as data diversity, iterative teacher-student training, and search-based methods for advancing safe ASI development. The work offers a structured framework for researchers and policymakers to evaluate and improve scalable oversight techniques, bridging current methods with long-term governance goals.

Abstract

The emergence of large language models (LLMs) has sparked the possibility of about Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. However, existing alignment paradigms struggle to guide such advanced AI systems. Superalignment, the alignment of AI systems with human values and safety requirements at superhuman levels of capability aims to addresses two primary goals -- scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we examine scalable oversight methods and potential solutions for superalignment. Specifically, we explore the concept of ASI, the challenges it poses, and the limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of ASI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.

Paper Structure

This paper contains 52 sections, 11 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Challenges from the perspectives of supervision and governance. While supervision perspective focuses on providing high-quality guidance signals for enhancing system competence, governance perspective emphasizes aligning the behavior of advanced aI with human values to prevent harmful outcomes.
  • Figure 2: Weak-to-strong generalization technique. A weak AI system is first trained using instructions and labeled responses. It is then used to generate pseudo-responses for training a strong AI system. Despite the potential noisiness of these pseudo-responses, the strong AI system's generalization capability (red border) is where essence of scalable oversight comes from, surpassing the performance of the weak AI one.
  • Figure 3: Scalable oversight techniques, categorized by key techniques, core concept & analysis, enhancement, evaluation and application.
  • Figure 4: Debate technique. Two AI systems engage in an adversarial dialogue aimed at convincing a judge of the correctness of their respective arguments. Given a question $Q$, each AI system (0 and 1) presents its answer ($A_0$, $A_1$) along with supporting statements ($S$). The judge evaluates the dialogue and selects the most convincing argument. Scalable oversight is achieved through this debate process (red box), as the judge can choose their decision on the dialogue rather than having to derive an answer from scratch.
  • Figure 5: The RLAIF technique replaces human feedback in RLHF with AI-generated critiques provided by an AI feedback system. This approach reduces dependence on human annotations, which is where its scalable oversight is realized (red border). An AI feedback system evaluates responses and trains a reward model that guides reinforcement learning. AI system is then optimized to maximize alignment with the AI-generated feedback with policy training.
  • ...and 1 more figures