Accelerating Robotic Reinforcement Learning with Agent Guidance
Haojun Chen, Zili Zou, Chengdong Ma, Yaoxiang Pu, Haotong Zhang, Yuanpei Chen, Yaodong Yang
TL;DR
AGPS replaces 1:1 human supervision with a multimodal agent acting as a semantic priors-based world model and uses an asynchronous FLOAT trigger with a groundable toolbox to provide Action Guidance and Exploration Pruning. It relies on a trajectory-deviation metric $d_{OT}$ and the FLOAT index $\lambda(\mathcal{T}_b)$ to decide when guidance is needed, and demonstrates zero-human-intervention learning on USB Insertion and Chinese Knot Hanging with improved sample efficiency. The approach grounds high-level semantics into precise geometric constraints via a Perception Module, an Action Primitives Library, and a Memory Module, enabling robust manipulation of rigid and deformable objects. Overall, AGPS demonstrates a scalable path to labor-free real-world robotic learning by leveraging semantic priors to structure exploration and recover from failures.
Abstract
Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps.
