Superalignment with Dynamic Human Values
Florian Mai, David Kaczér, Nicholas Kluge Corrêa, Lucie Flek
TL;DR
The paper addresses the challenge of aligning superhuman AI with changing human values while ensuring scalable oversight. It introduces a framework that decomposes hard tasks into subtasks solvable by a human-level AI proxy, with recomposition and a verifier to produce aligned complete solutions. The core idea is the part-to-complete generalization hypothesis, which posits that alignment of subtasks generalizes to the full task, and it outlines how to measure and improve this property. If validated, the approach could enable dynamic, safe alignment for high-stakes AI systems by keeping humans in the loop through structured task decomposition.
Abstract
Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.
