Table of Contents
Fetching ...

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

Ruixing Jin, Zicheng Zhu, Ruixiang Ouyang, Sheng Xu, Bo Yue, Zhizheng Wu, Guiliang Liu

Abstract

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

Abstract

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.
Paper Structure (40 sections, 5 equations, 6 figures, 8 tables)

This paper contains 40 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of our framework for analyzing Sim2Real generalization in Vision-Language-Action (VLA) models. We study how different Sim2Real techniques, including domain randomization, rendering fidelity, and reinforcement learning fine-tuning, influence generalization across Vision, Semantics, and Execution under both simulation OOD and real-world evaluations.
  • Figure 2: Real-world manipulation setup and randomized factors used in our experiments. Left: the physical platform with robot arms, camera, lighting, objects, and distractors. Right: examples of variations including object positions, lighting, backgrounds, object instances, and distractor configurations.
  • Figure 3: Rendering fidelity analysis. (a) Quantitative results showing the impact of photorealism and physical realism on Sim-OOD and real-world success rates. (b) Example renderings under different photorealism levels (Low, Medium, High).
  • Figure 4: Model architecture of OpenVLA-OFT.
  • Figure 5: Pose design for real-world evaluation across five manipulation tasks. The Place Empty Cup task includes eight pose variations, while the other tasks use four poses variations.
  • ...and 1 more figures