Table of Contents
Fetching ...

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, Yansong Tang

TL;DR

This paper addresses the instability of post-training vision-language-action policies in real-world robotic manipulation when using reinforcement learning, caused by inaccurate value estimates and sparse intermediate supervision. It introduces Hi-ORS, a simple, post-training method that uses outcome-based rejection sampling to stabilize training and a reward-weighted supervised objective to provide dense supervision across intermediate inference steps, together with an asynchronous human-in-the-loop framework for explicit error-recovery demonstrations. The approach is validated on three real-world tasks across two embodiments, delivering superior performance and sample efficiency (about 1.5 hours of training) compared with RL and IL baselines, and demonstrating test-time scalability through learned error-recovery behaviors. The results position Hi-ORS as a practical, robust baseline for fine-tuning vision-language-action policies in real-world robotic manipulation, with potential extensions to multi-task and longer-horizon settings.

Abstract

Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

TL;DR

This paper addresses the instability of post-training vision-language-action policies in real-world robotic manipulation when using reinforcement learning, caused by inaccurate value estimates and sparse intermediate supervision. It introduces Hi-ORS, a simple, post-training method that uses outcome-based rejection sampling to stabilize training and a reward-weighted supervised objective to provide dense supervision across intermediate inference steps, together with an asynchronous human-in-the-loop framework for explicit error-recovery demonstrations. The approach is validated on three real-world tasks across two embodiments, delivering superior performance and sample efficiency (about 1.5 hours of training) compared with RL and IL baselines, and demonstrating test-time scalability through learned error-recovery behaviors. The results position Hi-ORS as a practical, robust baseline for fine-tuning vision-language-action policies in real-world robotic manipulation, with potential extensions to multi-task and longer-horizon settings.

Abstract

Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Hi-ORS is a simple post-training method that stabilizes real-world RL. It replaces inaccurate value networks (e.g.,, in action chunking) with outcome-based rejection sampling, and implements a reward-weighted supervised training objective to distill dense intermediate-step supervision in VLAs (e.g.,, flow-matching–based). Hi-ORS also incorporates online human-in-the-loop corrections as explicit guidance for learning error-recovery behaviors.
  • Figure 2: The overall pipeline of Hi-ORS, which consists of a rejection sampling framework, a supervised training objective, a varied frequency strategy, and an asynchronous infrastructure. Hi-ORS enables both training stability and high robustness in post-training VLAs for real-world robotic manipulation. Here we take a flow matching-based policy $\pi_0$ as an example.
  • Figure 3: Real-world Settings, we design three real-world tasks across two embodiments with different challenging levels to systematically evaluate the proposed method.
  • Figure 4: Real-world Results. We report the evaluation success rate curve of different methods in three real-world robotic manipulation tasks with different embodiments.
  • Figure 5: Test-time Scaling in Insert-Moisturizer. We show that larger trial budgets in evaluation result in higher testing performance, which indicates a potential signal of test-time scaling in robotic manipulation.
  • ...and 2 more figures