Automated Creation of Digital Cousins for Robust Policy Learning
Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
TL;DR
Digital cousins offer a scalable alternative to digital twins by generating multiple virtual scenes that preserve high-level affordances from a single real image. The Automated Creation of Digital Cousins (ACDC) pipeline automatically extracts objects, matches them to a virtual asset library using DINOv2 and GPT, and composes fully interactive scenes for policy learning. Policies trained in these diverse digital cousins achieve comparable in-domain performance to twin-based policies while exhibiting greater robustness to unseen configurations and enabling zero-shot sim-to-real transfer. The work suggests a practical path toward scalable, robust robot learning without exhaustive real-world data collection or exact scene reconstruction.
Abstract
Training robot policies in the real world can be unsafe, costly, and difficult to scale. Simulation serves as an inexpensive and potentially limitless source of training data, but suffers from the semantics and physics disparity between simulated and real-world environments. These discrepancies can be minimized by training in digital twins, which serve as virtual replicas of a real scene but are expensive to generate and cannot produce cross-domain generalization. To address these limitations, we propose the concept of digital cousins, a virtual asset or scene that, unlike a digital twin, does not explicitly model a real-world counterpart but still exhibits similar geometric and semantic affordances. As a result, digital cousins simultaneously reduce the cost of generating an analogous virtual environment while also facilitating better robustness during sim-to-real domain transfer by providing a distribution of similar training scenes. Leveraging digital cousins, we introduce a novel method for their automated creation, and propose a fully automated real-to-sim-to-real pipeline for generating fully interactive scenes and training robot policies that can be deployed zero-shot in the original scene. We find that digital cousin scenes that preserve geometric and semantic affordances can be produced automatically, and can be used to train policies that outperform policies trained on digital twins, achieving 90% vs. 25% success rates under zero-shot sim-to-real transfer. Additional details are available at https://digital-cousins.github.io/.
