VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video
Brandon Man, Ghadi Nehme, Md Ferdous Alam, Faez Ahmed
TL;DR
This work introduces VideoCAD, a large-scale synthetic dataset with over $41{,}005$ video demonstrations of precision engineering CAD modeling, designed to capture long-horizon, 3D spatial interactions. Two downstream contributions are proposed: VideoCADFormer, a transformer-based model that learns UI interactions directly from video and target CAD images, outperforming baselines in long-horizon action prediction; and VideoCADQA, a synthetic VQA benchmark for multimodal 3D and spatiotemporal reasoning. The dataset's complexity is showcased by horizons up to $186$ low-level UI actions per task and its inclusion of pixel-level, 3D reasoning. Results reveal the model's ability to generate complete CAD sequences from an isometric target and to autocomplete partial designs, while highlighting gaps in current LLMs for grounded CAD understanding, motivating future multimodal pretraining and cross-domain automation.
Abstract
Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to 20x longer than those in other datasets. We show two important downstream applications of VideoCAD: (1) learning UI interactions from professional 3D CAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VideoCADFormer, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.
