Table of Contents
Fetching ...

Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity

HyunJin Kim, Xiaoyuan Yi, Jing Yao, Muhua Huang, JinYeong Bak, James Evans, Xing Xie

TL;DR

This work reframes AI alignment as a joint optimization of task competence and value conformity, introducing the notion of capacity-capability gaps to formalize superalignment. It defines precise formalizations for A, H, x, y, U, and C, and argues that superalignment is achievable through scalable supervision signal construction, not merely through post-hoc safeguards. Three paradigms—Sandwiching, Self-Enhancement, and Weak-to-Strong Generalization—are analyzed for limitations, guiding the authors to two core principles: (1) calibrating an appropriate capability–capacity gap, and (2) diversifying supervision signals to avoid noise and mode collapse. A novel conceptual framework is proposed, featuring gap-annealing and alternating competence and conformity, with evaluation via surrogate testing, extrapolation, and peer review, plus multi-objective benchmarks like RewardBench to assess alignment under scaling. If operationalized, this pathway could yield practical, scalable value alignment for next-generation AI while providing a structured approach to mitigate emergent risks during progressive capability growth.

Abstract

The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.

Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity

TL;DR

This work reframes AI alignment as a joint optimization of task competence and value conformity, introducing the notion of capacity-capability gaps to formalize superalignment. It defines precise formalizations for A, H, x, y, U, and C, and argues that superalignment is achievable through scalable supervision signal construction, not merely through post-hoc safeguards. Three paradigms—Sandwiching, Self-Enhancement, and Weak-to-Strong Generalization—are analyzed for limitations, guiding the authors to two core principles: (1) calibrating an appropriate capability–capacity gap, and (2) diversifying supervision signals to avoid noise and mode collapse. A novel conceptual framework is proposed, featuring gap-annealing and alternating competence and conformity, with evaluation via surrogate testing, extrapolation, and peer review, plus multi-objective benchmarks like RewardBench to assess alignment under scaling. If operationalized, this pathway could yield practical, scalable value alignment for next-generation AI while providing a structured approach to mitigate emergent risks during progressive capability growth.

Abstract

The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.

Paper Structure

This paper contains 30 sections, 1 equation, 6 figures, 15 tables.

Figures (6)

  • Figure 1: The progression of AI. The solid blue arrow represents the achievable path through existing alignment methods. The dashed blue line illustrates the intended progression towards safe ASI via superalignment, while the dashed red line indicates the potentially risky ASI without human values, leading to catastrophic outcomes.
  • Figure 2: Three paradigms for superalignment. (1) Sandwiching: Humans interact with AI to produce a supervision signal for training a stronger AI. (2) Self-Enhancement: AI models refine their answers independently or collaboratively without human supervision to produce the signal. 3) Weak-to-Strong Generalization: A sequence of AI models with increasing capacities generate and refine signals.
  • Figure 3: Training accuracy of Qwen2.5-7B-Instruct on CosmosQA with different ratios (0.0 to 1.0) of noise (flipped labels) in the training set. The model shows a stronger tendency to fit noise due to its large parameter size, except in the range of 0.4 to 0.6 (Green), where labels are heavily randomized. In contrast, Fig. \ref{['fig:noise_train_accuracy_full']} shows that smaller models struggle more with noise. More results and discussions are presented in Tab. \ref{['tab:noise_cosmosqa']} and Tab. \ref{['tab:noise_sciq']} in Appendix.
  • Figure 4: Overview of our proposed conceptual framework. The superscripts $+$ and $-$ indicate aligned and unaligned AI, respectively. Left: (1) Mere competence enhancement might cause risky AI behavior. (2) Directly aligning ASI with capability beyond human score is challenging. (3) A training path by alternating optimization of competence and conformity. Right: Three potential paradigms for evaluating the success of superalignment targeting at $\mathbf{x}$ beyond human understanding.
  • Figure 5: Training accuracy on CosmosQA across varying model sizes (left:smallest to right:largest) and levels of noise in the training dataset (0.0 to 1.0). The larger model, Qwen2.5-7B-Instruct, shows a stronger tendency to overfit, learning even in noisy conditions, except in the range of 0.4 to 0.6, where labels are heavily randomized. In contrast, the smaller model, Qwen2.5-3B-Instruct, struggles to learn effectively when noise levels are between 0.3 and 0.7.
  • ...and 1 more figures