Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu
TL;DR
The paper tackles the problem of aligning superhuman AI when direct human supervision becomes infeasible by proposing recursive self-critiquing as a scalable oversight paradigm. It hypothesizes that critique of critique is easier than critique itself, and that this difficulty relation can be extended recursively, enabling multi-level meta-evaluations to supervise AI outputs. Through comprehensive Human-Human, Human-AI, and AI-AI experiments across diverse tasks, the authors demonstrate that higher-order critiques can improve accuracy, confidence, and efficiency relative to direct evaluation or simple voting baselines, though effectiveness can depend on relative model capabilities. The work suggests a feasible pathway toward scalable oversight as AI systems continue to surpass human abilities, while acknowledging current limitations and the need for further improvements in automatic critique capabilities and supervision reliability.
Abstract
As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques, including SFT and RLHF, face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become impractical when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) \textit{Critique of critique can be easier than critique itself}, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) \textit{This difficulty relationship holds recursively}, suggesting that when direct evaluation is infeasible, performing higher-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. We conduct Human-Human, Human-AI, and AI-AI experiments to investigate the potential of recursive self-critiquing for AI supervision. Our results highlight recursive critique as a promising approach for scalable AI oversight.
