The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment
HyunJin Kim, Xiaoyuan Yi, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, JinYeong Bak, Xing Xie
TL;DR
This survey defines the superalignment problem as the need to scale supervision and governance for artificial superintelligence (ASI). It formalizes a taxonomy of AI types (ANI, AGI, ASI) and introduces scalable oversight concepts, including a formal criterion for alignment under expensive evaluation signals. The core of the paper analyzes four scalable oversight paradigms—Weak-to-Strong Generalization, Debate, Reinforcement Learning from AI Feedback, and Sandwiching—covering definitions, formalizations, enhancements, and applications to reasoning and vision. It also discusses key challenges (signal scalability, adversarial behavior, expert dependence, bias amplification) and proposes directions such as data diversity, iterative teacher-student training, and search-based methods for advancing safe ASI development. The work offers a structured framework for researchers and policymakers to evaluate and improve scalable oversight techniques, bridging current methods with long-term governance goals.
Abstract
The emergence of large language models (LLMs) has sparked the possibility of about Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. However, existing alignment paradigms struggle to guide such advanced AI systems. Superalignment, the alignment of AI systems with human values and safety requirements at superhuman levels of capability aims to addresses two primary goals -- scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we examine scalable oversight methods and potential solutions for superalignment. Specifically, we explore the concept of ASI, the challenges it poses, and the limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of ASI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.
