Table of Contents
Fetching ...

VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video

Brandon Man, Ghadi Nehme, Md Ferdous Alam, Faez Ahmed

TL;DR

This work introduces VideoCAD, a large-scale synthetic dataset with over $41{,}005$ video demonstrations of precision engineering CAD modeling, designed to capture long-horizon, 3D spatial interactions. Two downstream contributions are proposed: VideoCADFormer, a transformer-based model that learns UI interactions directly from video and target CAD images, outperforming baselines in long-horizon action prediction; and VideoCADQA, a synthetic VQA benchmark for multimodal 3D and spatiotemporal reasoning. The dataset's complexity is showcased by horizons up to $186$ low-level UI actions per task and its inclusion of pixel-level, 3D reasoning. Results reveal the model's ability to generate complete CAD sequences from an isometric target and to autocomplete partial designs, while highlighting gaps in current LLMs for grounded CAD understanding, motivating future multimodal pretraining and cross-domain automation.

Abstract

Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to 20x longer than those in other datasets. We show two important downstream applications of VideoCAD: (1) learning UI interactions from professional 3D CAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VideoCADFormer, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.

VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video

TL;DR

This work introduces VideoCAD, a large-scale synthetic dataset with over video demonstrations of precision engineering CAD modeling, designed to capture long-horizon, 3D spatial interactions. Two downstream contributions are proposed: VideoCADFormer, a transformer-based model that learns UI interactions directly from video and target CAD images, outperforming baselines in long-horizon action prediction; and VideoCADQA, a synthetic VQA benchmark for multimodal 3D and spatiotemporal reasoning. The dataset's complexity is showcased by horizons up to low-level UI actions per task and its inclusion of pixel-level, 3D reasoning. Results reveal the model's ability to generate complete CAD sequences from an isometric target and to autocomplete partial designs, while highlighting gaps in current LLMs for grounded CAD understanding, motivating future multimodal pretraining and cross-domain automation.

Abstract

Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to 20x longer than those in other datasets. We show two important downstream applications of VideoCAD: (1) learning UI interactions from professional 3D CAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VideoCADFormer, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.

Paper Structure

This paper contains 75 sections, 36 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Illustration of the VideoCAD dataset pipeline: human-authored CAD sequences are converted into UI instructions and executed via a rule-based automated method to record videos. Quality filtering, keyframe extraction, and action alignment produce structured video-action pairs.
  • Figure 2: Example of intermediate modeling stages in VideoCAD. A sequence of snapshots illustrating the progressive construction of a CAD model through successive sketching and extrusion operations.
  • Figure 3: Statistical distributions of CAD UI actions and UI sequence lengths. a. Action command frequencies. b. UI sequence length frequencies.
  • Figure 4: Overview of VideoCADFormer for CAD UI action prediction. The model encodes the target image and past UI frames via ViT, fuses them with projected past actions using a cross-attention decoder, and predicts the next action to iteratively build the CAD model in Onshape.
  • Figure 5: Predicted CAD models from VideoCADFormer, conditioned on (a) a target image for generation from scratch, and (b) a partial UI state for autocompletion.
  • ...and 9 more figures