Table of Contents
Fetching ...

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, Jiangmiao Pang

TL;DR

VLAC tackles sparse rewards in real-world robotic RL by learning dense, pairwise progress signals and jointly generating actions within a single autoregressive model. It integrates multimodal perception with language prompts, enabling zero-shot and in-context transfer across tasks while trading off a low-latency, asynchronous infrastructure and human-in-the-loop interventions to stabilize learning. The approach demonstrates substantial gains in real-world manipulation success rates (from ~30% to ~90% in 200 episodes) and shows strong generalization across rooms, tasks, and even multiple robots, with improved sample efficiency when guided by human input. This work provides a practical pathway for data-efficient, scalable, multimodal RL in real-world robotic systems, reducing reliance on handcrafted rewards and extensive task-specific engineering.

Abstract

Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30\% to about 90\% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

TL;DR

VLAC tackles sparse rewards in real-world robotic RL by learning dense, pairwise progress signals and jointly generating actions within a single autoregressive model. It integrates multimodal perception with language prompts, enabling zero-shot and in-context transfer across tasks while trading off a low-latency, asynchronous infrastructure and human-in-the-loop interventions to stabilize learning. The approach demonstrates substantial gains in real-world manipulation success rates (from ~30% to ~90% in 200 episodes) and shows strong generalization across rooms, tasks, and even multiple robots, with improved sample efficiency when guided by human input. This work provides a practical pathway for data-efficient, scalable, multimodal RL in real-world robotic systems, reducing reliance on handcrafted rewards and extensive task-specific engineering.

Abstract

Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30\% to about 90\% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.

Paper Structure

This paper contains 25 sections, 11 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview. Pretrained on multi-source data, VLAC provides dense progress rewards for real-world RL while also serving as a policy to output actions, integrating into real-world RL loops to enable self-improvement in manipulation.
  • Figure 2: The VLAC model is trained on a combination of comprehensive public robotic manipulation datasets, human demonstration data, self-collected manipulation data, and various image understanding datasets. Video data is processed into pair-wise samples to learn the different task progress between two frames, supplemented with task descriptions and task completion evaluation to enable task progress understanding and action generation, as illustrated in the bottom-left corner. As shown in the diagram on the right, the model demonstrates strong generalization capabilities to new robots, scenarios, and tasks not covered in the training dataset. It can predict task progress and distinguish failure action or trajectory, providing dense reward feedback for real-world RL and offering guidance for data refinement. Additionally, the model can directly perform manipulation tasks, exhibiting zero-shot capabilities to handle different scenarios.
  • Figure 3: VLAC forward pass generates structured action tokens, reward tokens, and a value head is attached to estimate state value for PPO updates.
  • Figure 4: VOC-F1 Performance Comparison Across Different Models and Datasets.
  • Figure 5: Example results of VLAC for task progress understanding.
  • ...and 5 more figures