Table of Contents
Fetching ...

Automated Creation of Digital Cousins for Robust Policy Learning

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, Li Fei-Fei

TL;DR

Digital cousins offer a scalable alternative to digital twins by generating multiple virtual scenes that preserve high-level affordances from a single real image. The Automated Creation of Digital Cousins (ACDC) pipeline automatically extracts objects, matches them to a virtual asset library using DINOv2 and GPT, and composes fully interactive scenes for policy learning. Policies trained in these diverse digital cousins achieve comparable in-domain performance to twin-based policies while exhibiting greater robustness to unseen configurations and enabling zero-shot sim-to-real transfer. The work suggests a practical path toward scalable, robust robot learning without exhaustive real-world data collection or exact scene reconstruction.

Abstract

Training robot policies in the real world can be unsafe, costly, and difficult to scale. Simulation serves as an inexpensive and potentially limitless source of training data, but suffers from the semantics and physics disparity between simulated and real-world environments. These discrepancies can be minimized by training in digital twins, which serve as virtual replicas of a real scene but are expensive to generate and cannot produce cross-domain generalization. To address these limitations, we propose the concept of digital cousins, a virtual asset or scene that, unlike a digital twin, does not explicitly model a real-world counterpart but still exhibits similar geometric and semantic affordances. As a result, digital cousins simultaneously reduce the cost of generating an analogous virtual environment while also facilitating better robustness during sim-to-real domain transfer by providing a distribution of similar training scenes. Leveraging digital cousins, we introduce a novel method for their automated creation, and propose a fully automated real-to-sim-to-real pipeline for generating fully interactive scenes and training robot policies that can be deployed zero-shot in the original scene. We find that digital cousin scenes that preserve geometric and semantic affordances can be produced automatically, and can be used to train policies that outperform policies trained on digital twins, achieving 90% vs. 25% success rates under zero-shot sim-to-real transfer. Additional details are available at https://digital-cousins.github.io/.

Automated Creation of Digital Cousins for Robust Policy Learning

TL;DR

Digital cousins offer a scalable alternative to digital twins by generating multiple virtual scenes that preserve high-level affordances from a single real image. The Automated Creation of Digital Cousins (ACDC) pipeline automatically extracts objects, matches them to a virtual asset library using DINOv2 and GPT, and composes fully interactive scenes for policy learning. Policies trained in these diverse digital cousins achieve comparable in-domain performance to twin-based policies while exhibiting greater robustness to unseen configurations and enabling zero-shot sim-to-real transfer. The work suggests a practical path toward scalable, robust robot learning without exhaustive real-world data collection or exact scene reconstruction.

Abstract

Training robot policies in the real world can be unsafe, costly, and difficult to scale. Simulation serves as an inexpensive and potentially limitless source of training data, but suffers from the semantics and physics disparity between simulated and real-world environments. These discrepancies can be minimized by training in digital twins, which serve as virtual replicas of a real scene but are expensive to generate and cannot produce cross-domain generalization. To address these limitations, we propose the concept of digital cousins, a virtual asset or scene that, unlike a digital twin, does not explicitly model a real-world counterpart but still exhibits similar geometric and semantic affordances. As a result, digital cousins simultaneously reduce the cost of generating an analogous virtual environment while also facilitating better robustness during sim-to-real domain transfer by providing a distribution of similar training scenes. Leveraging digital cousins, we introduce a novel method for their automated creation, and propose a fully automated real-to-sim-to-real pipeline for generating fully interactive scenes and training robot policies that can be deployed zero-shot in the original scene. We find that digital cousin scenes that preserve geometric and semantic affordances can be produced automatically, and can be used to train policies that outperform policies trained on digital twins, achieving 90% vs. 25% success rates under zero-shot sim-to-real transfer. Additional details are available at https://digital-cousins.github.io/.

Paper Structure

This paper contains 48 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview. Fully interactive digital cousin scenes can be generated completely automatically from a single RGB image. Unlike a digital twin, digital cousins relax the assumption of completely reconstructing the minute details of a given scene and instead focus on preserving higher-level details, such as spatial relationships and semantic affordances. By leveraging motion planning and ground-truth simulation information, we can automatically collect demonstrations in our digital cousin scenes, augmented with physically plausible randomizations. A policy trained on these synthetic demonstrations can then be deployed zero-shot in the original scene, without requiring any additional finetuning.
  • Figure 2: ACDC Pipeline. ACDC is composed of three sequential steps. (1) First, relevant per-object information is extracted the input RGB image. (2) Next, we use this information with an asset dataset to match digital cousins to each detected input object. (3) Finally, we post-process the chosen digital cousins and generate a fully-interactive simulated scene.
  • Figure 3: Qualitative real-to-sim digital cousin scene generation results. Multiple cousins are shown with a robot collecting demonstrations. Please refer to \ref{['subsec:supp-real2sim_scene_generation']} for more results.
  • Figure 4: Sim-to-sim policy results. Aggregated success rates of policies trained on the exact twin, different numbers of cousins, and all assets in the three nearest categories. Policies are tested on four setups: the exact digital twin, and three increasingly dissimilar setups as measured by DINOv2 embedding distance to probe zero-shot generalization. Note for Task 3, there are much fewer cabinet models that enable the task to be feasible, so we only compare the digital-twin and 8-cousin policies. Note that during digital cousin training data does not include any of the evaluation instances. Additional information at \ref{['subsec:supp-sim2sim_policy_learning']}.
  • Figure 5: Zero-shot real-world evaluation of digital cousin policy vs. digital twin baselines. Task is Door Opening on an IKEA cabinet. Metric is success rate: sim/real results averaged over 50/20 trials. Twin $+ \uparrow$DR is trained using increased domain (pose, scale) randomization, and Twin $+$ Cousin is trained on both twin and cousin data.
  • ...and 9 more figures