Table of Contents
Fetching ...

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zang, Weinan Zhang, Chao Yu, Yu Wang

TL;DR

An RL-based sim-real co-training framework that leverages interactive simulation while preserving real-world capabilities is proposed, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

Abstract

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $π_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $π_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

TL;DR

An RL-based sim-real co-training framework that leverages interactive simulation while preserving real-world capabilities is proposed, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

Abstract

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and , and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on . Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.
Paper Structure (36 sections, 8 equations, 12 figures, 4 tables)

This paper contains 36 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of training paradigms combining real-world and simulated data. VLA models are commonly trained via supervised fine-tuning (SFT) on real-world demonstrations, or via reinforcement learning (RL) in simulation followed by sim-to-real transfer. Other approaches adopt SFT-based sim--real co-training by mixing real and simulated demonstrations. In contrast, we propose an RL-based sim--real co-training (RL-Co) framework, which initializes the model with sim--real SFT and subsequently performs RL in simulation while using real-world SFT as a regularization signal.
  • Figure 2: Overview of the proposed two-stage sim-real co-training framework. We establish a digital-twin setup where $T_{\text{sim}}$ serves as a digital cousin to $T_{\text{real}}$ despite visual discrepancies. In Stage I, we initialize the VLA policy by supervising it on a mixture of real and simulated data (ratio $\alpha$). This rapidly injects real-world knowledge and prepares the policy for simulation interaction. In Stage II, we perform RL fine-tuning in the simulator to explore and improve performance, simultaneously employing a real-world SFT loss as a regularizer to prevent the forgetting of real-world behaviors.
  • Figure 3: Visualization of our tabletop manipulation tasks. The top row shows images captured by a third-person camera in the real-world setup, while the bottom row presents the corresponding simulated views. Both real and simulated images are sampled from the task execution.
  • Figure 4: Analysis of the co-training ratio ($\alpha$) and regularization weight ($\beta$). We vary the co-training ratio $\alpha$ and evaluate the resulting performance on the Pick and Place and Open Drawer tasks. In addition, we fix $\alpha=0.5$ for Pick and Place and $\alpha=0.95$ for Open Drawer, reporting RL co-training results under different regularization weights $\beta$. Performance is measured by success rate, with shaded regions indicating standard deviation.
  • Figure 5: Ablation study on simulation SFT initialization. We report the simulation success rate during RL training for models trained with and without simulation SFT initialization. Each RL training process is run with three independent random seeds, and results are presented as the mean success rate with shaded regions indicating the standard deviation.
  • ...and 7 more figures