Table of Contents
Fetching ...

Accelerating Robotic Reinforcement Learning with Agent Guidance

Haojun Chen, Zili Zou, Chengdong Ma, Yaoxiang Pu, Haotong Zhang, Yuanpei Chen, Yaodong Yang

TL;DR

AGPS replaces 1:1 human supervision with a multimodal agent acting as a semantic priors-based world model and uses an asynchronous FLOAT trigger with a groundable toolbox to provide Action Guidance and Exploration Pruning. It relies on a trajectory-deviation metric $d_{OT}$ and the FLOAT index $\lambda(\mathcal{T}_b)$ to decide when guidance is needed, and demonstrates zero-human-intervention learning on USB Insertion and Chinese Knot Hanging with improved sample efficiency. The approach grounds high-level semantics into precise geometric constraints via a Perception Module, an Action Primitives Library, and a Memory Module, enabling robust manipulation of rigid and deformable objects. Overall, AGPS demonstrates a scalable path to labor-free real-world robotic learning by leveraging semantic priors to structure exploration and recover from failures.

Abstract

Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps.

Accelerating Robotic Reinforcement Learning with Agent Guidance

TL;DR

AGPS replaces 1:1 human supervision with a multimodal agent acting as a semantic priors-based world model and uses an asynchronous FLOAT trigger with a groundable toolbox to provide Action Guidance and Exploration Pruning. It relies on a trajectory-deviation metric and the FLOAT index to decide when guidance is needed, and demonstrates zero-human-intervention learning on USB Insertion and Chinese Knot Hanging with improved sample efficiency. The approach grounds high-level semantics into precise geometric constraints via a Perception Module, an Action Primitives Library, and a Memory Module, enabling robust manipulation of rigid and deformable objects. Overall, AGPS demonstrates a scalable path to labor-free real-world robotic learning by leveraging semantic priors to structure exploration and recover from failures.

Abstract

Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps.
Paper Structure (29 sections, 3 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview Left: HIL methods encounter a scalability barrier as task complexity rises, restricted by the 1:1 supervision ratio and operator fatigue. Right: AGPS transcends this barrier by automating supervision. The system employs FLOAT as an asynchronous trigger to monitor policy performance. When a deviation is detected, the agent recalls memory and leverages a toolbox (Action Primitives, Perception, Geometry) for spatial reasoning. These interventions manifest as Action Guidance for trajectory correction and Exploration Pruning for spatial constraining.
  • Figure 2: Overall performance comparisons. The curves illustrate the evolution of success rate with respect to training steps and wall-clock time. We evaluate the policy at 4 distinct checkpoints, with success rates calculated over 10 evaluation episodes per checkpoint.
  • Figure 3: Comparison of USB insertion tasks. (a) illustrates a failure case; (b) shows the AGPS spatial reasoning where red points denote semantic keypoints and the green box represents the task-relevant volume for exploration pruning.
  • Figure 4: Visualization of policy training dynamics. The heatmaps visualize the value distribution across the Y-Z spatial coordinates at different training steps (400 to 1200). (Bottom) HIL-SERL learns a narrow high-value corridor. This indicates overfitting to specific human demonstrations, leaving surrounding states with near-zero value (blue regions). (Top) AGPS develops a broad high-value funnel. This shows the policy has learned recovery behaviors for states deviating from the optimal trajectory, enabling it to handle misalignments.
  • Figure 5: Ablation and intervention analysis.(Left.)The memory module (red) accelerates convergence by $2\times$ compared to the baseline (blue). (Right). The number of triggers per 10 rollouts decreases over time. Both tasks achieve zero intervention ultimately.
  • ...and 2 more figures