Table of Contents
Fetching ...

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee

TL;DR

This work tackles the data bottleneck in embodied AI by proposing D2E, which leverages abundant desktop interactions as a scalable pretraining substrate. It introduces the OWA Toolkit and OWAMcap to capture and compress diverse desktop data, a Generalist-IDM that generalizes across unseen games via timestamp-based next-event prediction, and VAPT to transfer desktop-learned priors to robotics. The approach yields strong results on LIBERO and CANVAS benchmarks, and demonstrates transfer to real-world robot manipulation and navigation, with notable gains from pseudo-labeling YouTube gameplay. The contributions collectively establish desktop data as a practical, scalable resource for embodied intelligence and provide open-source tools and datasets to empower further research.

Abstract

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

TL;DR

This work tackles the data bottleneck in embodied AI by proposing D2E, which leverages abundant desktop interactions as a scalable pretraining substrate. It introduces the OWA Toolkit and OWAMcap to capture and compress diverse desktop data, a Generalist-IDM that generalizes across unseen games via timestamp-based next-event prediction, and VAPT to transfer desktop-learned priors to robotics. The approach yields strong results on LIBERO and CANVAS benchmarks, and demonstrates transfer to real-world robot manipulation and navigation, with notable gains from pseudo-labeling YouTube gameplay. The contributions collectively establish desktop data as a practical, scalable resource for embodied intelligence and provide open-source tools and datasets to empower further research.

Abstract

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

Paper Structure

This paper contains 70 sections, 10 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Overview of D2E framework. (1) The OWA Toolkit captures 335.6 hours of rich desktop demonstrations across 31 games with 152× compression. (2) The Generalist-IDM uses next-event prediction with temporal offset (NEP-$\tau$) to achieve OOD generalization, enabling pseudo-labeling of 1K+ hours of YouTube gameplay. (3) Vision-Action Pretraining transfers desktop-pretrained representations to embodied AI, achieving 96.6% success on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks which demonstrates desktop-to-robotics transfer.
  • Figure 2: OWA Toolkit's recording and storage architecture. (Left) ocap recorder captures perfectly synchronized multimodal streams—video (60 FPS), audio, mouse events, keyboard inputs, and window states—with precise time alignment, enabling accurate reconstruction of desktop interactions. (Right) OWAMcap format revolutionizes desktop data storage through its dual-layer architecture: standardized MCAP container for crash-safe metadata and event logging, paired with external media referencing for optimized video storage using H.265 codec (217× compression). This design achieves dramatic storage reduction—152× for VPT dataset (1.06 TiB → 7.12 GiB) and 34.45× for CS:GO dataset (689 GiB → 20 GiB)—while maintaining event fidelity and enabling efficient random access for training.
  • Figure 3: Our FSLDataset design, coupled with a batched decoding API, converts fine-grained random I/O into coarse, coalesced random access, thereby avoiding the limitations of large-scale filesystems that are inefficient for small random reads.
  • Figure 4: Trajectory of Battlefield 6.
  • Figure 5: Out-of-distribution performance on unseen 3D and 2D games. Note that Ogu Forest uses only keyboard inputs.
  • ...and 4 more figures