Steering LLMs via Scalable Interactive Oversight

Enyu Zhou; Zhiheng Xi; Long Ma; Zhihao Zhang; Shihan Dou; Zhikai Lei; Guoteng Wang; Rui Zheng; Hang Yan; Tao Gui; Qi Zhang; Xuanjing Huang

Steering LLMs via Scalable Interactive Oversight

Enyu Zhou, Zhiheng Xi, Long Ma, Zhihao Zhang, Shihan Dou, Zhikai Lei, Guoteng Wang, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

The paper tackles the challenge of aligning powerful LLMs with imperfect human intent in long-horizon tasks by introducing Scalable Interactive Oversight, a recursive, tree-structured interaction framework that elicits low-burden feedback at leaf nodes and aggregates it into global guidance before execution. It validates the approach on a vibe-coding task—web development PRD generation—demonstrating up to 54% improvement in alignment over baselines and enabling online RL from human feedback to further improve performance and efficiency. The Sandwich Protocol underpins the evaluation, using a non-expert supervisor, a capable model, and an expert evaluator to bound achievable alignment and guide methodological design. The work also shows that reinforcement learning with online human feedback, optionally combined with expert rewards, generalizes to untrained modules and accelerates interactive efficiency, offering a practical pathway for maintaining human control as AI scales. Overall, the framework advances controllability in AI through structured, scalable human supervision that preemptively translates vague intent into precise, verifiable specifications.

Abstract

As Large Language Models increasingly automate complex, long-horizon tasks such as \emph{vibe coding}, a supervision gap has emerged. While models excel at execution, users often struggle to guide them effectively due to insufficient domain expertise, the difficulty of articulating precise intent, and the inability to reliably validate complex outputs. It presents a critical challenge in scalable oversight: enabling humans to responsibly steer AI systems on tasks that surpass their own ability to specify or verify. To tackle this, we propose Scalable Interactive Oversight, a framework that decomposes complex intent into a recursive tree of manageable decisions to amplify human supervision. Rather than relying on open-ended prompting, our system elicits low-burden feedback at each node and recursively aggregates these signals into precise global guidance. Validated in web development task, our framework enables non-experts to produce expert-level Product Requirement Documents, achieving a 54\% improvement in alignment. Crucially, we demonstrate that this framework can be optimized via Reinforcement Learning using only online user feedback, offering a practical pathway for maintaining human control as AI scales.

Steering LLMs via Scalable Interactive Oversight

TL;DR

Abstract

Paper Structure (58 sections, 4 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 58 sections, 4 equations, 13 figures, 4 tables, 1 algorithm.

Introduction
Preliminary & Problem Setup
Preliminary: The "Sandwich" Protocol
Problem Setup
Method: Scalable Interactive Oversight
Decomposition initializing:
Interacting at node-level:
Updating the task-decomposition:
This design adopts three mechanisms for scalable oversight:
Empirical Validation of Scalable Interactive Oversight Framework at Test Time
Setup
Evaluation settings.
User simulation.
Baselines.
Results
...and 43 more sections

Figures (13)

Figure 1: Motivation: As AI increasingly surpasses humans in solving complex problems, people often delegate tasks such as software development to AI using only natural language instructions. However, misalignment arises in such collaboration. This is because humans become weak supervisors; they struggle to provide feedback on large outputs and challenging tasks. To tackle this, we propose a Framework: We decompose the task into a structured tree $\mathcal{T}^t$. After the interaction at node $v^t$, the user preference is accumulated to update $\mathcal{T}^t$ to $\mathcal{T}^{t+1}$. So the interaction afterwards will be more aligned with the user. The system loops until all nodes are completed.
Figure 2: Results of test time experiments. The model means the doc generator, i.e., the model to be aligned. Module1-Module5 are the PRD modules: product overview, core function, non-functional requirements, business rules, and user experience design. Best results are bolded.
Figure 2: Test results for the RL model. For the left parts, we use the test setting same as training (i.e. gemini-2.5-pro as doc generator, o4-mini as tree updator). We also use GPT-5 as the tree updator and the doc generator in test-time to test if the model could fit into unseen settings. M1-M5 is the five parts of PRD as Table \ref{['tab:main_results']}, where M3–M5 are not included during training (marked in $\dagger$).
Figure 3: Alignment score evolution over interaction. Scores are measured from intermediate documents generated with cumulative preferences with GPT-5 as interaction agents (Left: simulated user; Right: human user).
Figure 4: Results of ablation study. We test on the fisrt 2 modules with GPT-5 as the interacton model.
...and 8 more figures

Steering LLMs via Scalable Interactive Oversight

TL;DR

Abstract

Steering LLMs via Scalable Interactive Oversight

Authors

TL;DR

Abstract

Table of Contents

Figures (13)