Table of Contents
Fetching ...

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won, Jinwoo Shin

TL;DR

RoboCurate tackles the data bottleneck in robot learning by generating diverse neural trajectories and validating actions through simulator-replay, addressing the limitations of video-level plausibility checks. The framework combines controllable visual and instruction diversification with an action-level filtering mechanism powered by an attentive motion-alignment probe, plus a Best-of-N sampling strategy to improve generated data quality. Empirical results across GR-1 Tabletop, DexMimicGen, and ALLEX demonstrate substantial improvements over real-data baselines and prior neural-trajectory methods, including strong out-of-distribution generalization. The work advances data-centric robot learning by aligning synthetic observations with physically grounded actions, facilitating reliable policy training and cross-embodiment transfer in both simulated and real-world settings.

Abstract

Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

TL;DR

RoboCurate tackles the data bottleneck in robot learning by generating diverse neural trajectories and validating actions through simulator-replay, addressing the limitations of video-level plausibility checks. The framework combines controllable visual and instruction diversification with an action-level filtering mechanism powered by an attentive motion-alignment probe, plus a Best-of-N sampling strategy to improve generated data quality. Empirical results across GR-1 Tabletop, DexMimicGen, and ALLEX demonstrate substantial improvements over real-data baselines and prior neural-trajectory methods, including strong out-of-distribution generalization. The work advances data-centric robot learning by aligning synthetic observations with physically grounded actions, facilitating reliable policy training and cross-embodiment transfer in both simulated and real-world settings.

Abstract

Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.
Paper Structure (39 sections, 8 equations, 6 figures, 7 tables)

This paper contains 39 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of RoboCurate. (1) We generate diverse neural trajectory by applying image-to-image (I2I) model for scene diversity and video-to-video (V2V) model for appearance diversity, respectively. (2) We then filter neural trajectory using simulator-replay consistency, retaining only those for which a classifier predicts the motion in the generated video matches the simulator rollout.
  • Figure 2: Examples of neural trajectory. (Top): original videos, (Bottom): visually augmented neural trajectory. The two bottom-left frames indicate a video whose initial frame is edited by I2I model, while the two bottom-right frames indicate a video processed with V2V transfer.
  • Figure 3: Examples of negative pairs for attentive probe training. We construct negative pairs from real-world dataset by inducing temporal shifts or sampling video from different episodes.
  • Figure 4: An overview of experimental design for RoboCurate. We conduct two-phase experiments: (1) pre-training on real data and neural trajectory followed by fine-tuning on simulation data, and (2) co-finetuning on real data and neural trajectory.
  • Figure 5: Visualization of benchmarks.We visualize our benchmark settings (from left to right): (1) GR-1 Tabletop nvidia2025gr00tn1openfoundation, (2) DexMimicGen jiang2025dexmimicgen with bimanual Panda arms with dexterous hands, (3) DexMimicGen with GR-1 humanoid, and (4) a real-robot benchmark on dexterous-hand humanoid robot ALLEX.
  • ...and 1 more figures