Table of Contents
Fetching ...

Superalignment with Dynamic Human Values

Florian Mai, David Kaczér, Nicholas Kluge Corrêa, Lucie Flek

TL;DR

The paper addresses the challenge of aligning superhuman AI with changing human values while ensuring scalable oversight. It introduces a framework that decomposes hard tasks into subtasks solvable by a human-level AI proxy, with recomposition and a verifier to produce aligned complete solutions. The core idea is the part-to-complete generalization hypothesis, which posits that alignment of subtasks generalizes to the full task, and it outlines how to measure and improve this property. If validated, the approach could enable dynamic, safe alignment for high-stakes AI systems by keeping humans in the loop through structured task decomposition.

Abstract

Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.

Superalignment with Dynamic Human Values

TL;DR

The paper addresses the challenge of aligning superhuman AI with changing human values while ensuring scalable oversight. It introduces a framework that decomposes hard tasks into subtasks solvable by a human-level AI proxy, with recomposition and a verifier to produce aligned complete solutions. The core idea is the part-to-complete generalization hypothesis, which posits that alignment of subtasks generalizes to the full task, and it outlines how to measure and improve this property. If validated, the approach could enable dynamic, safe alignment for high-stakes AI systems by keeping humans in the loop through structured task decomposition.

Abstract

Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.

Paper Structure

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: Example of part-to-complete generalization in the dinner table reservation task, in which an AI agent is tasked to book a restaurant that satisfies the preferences of all attendees. Partial solutions to sub-tasks are assumed to be well-aligned in isolation. However, the alignment of the complete solution depends on how the partial solutions are recomposed: While in the aligned composition the AI agent first identifies the overlap before booking a single restaurant, in the unsafe composition, tables are booked individually before identifying an overlap, leading to many unnecessary reservations. In Section \ref{['sec:improving-part-to-complete-generalization']} we discuss strategies to steer the model toward aligned compositions.
  • Figure 2: Our proposed approach (see Section \ref{['sec:framework']}) for maintaining human oversight in superalignment through part-to-complete generalization. (a) On a regular basis, a human-level AI $H_{\phi}$ is aligned to humans $H$ on human-level tasks $\mathcal{E}$ to account for the dynamic nature of human values. After adapting the human-level AI, we train the superhuman planner model $P_{\theta}$ on superhuman tasks $\mathcal{D}$. (b) A reasoning model $P_\theta$ decomposes each task $X$ into simpler subtasks. Each subtask is solved and judged by the human-level aligned AI $H_{\phi}$. The reasoning model then recomposes the partial solutions into a complete solution, which is verified for correctness using a rules-based verifier $V$. The reasoning model is then updated using a reinforcement learning algorithm RLFT (e.g., PPO schulman2017proximal) based on the correctness reward $R$ and partial alignment rewards. With the part-to-complete generalization hypothesis, we expect the alignment of solutions to subtasks to generalize to the complete solution.