CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning
Ke Niu, Zhuofan Chen, Haiyang Yu, Yuwen Chen, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue
TL;DR
This work tackles the challenge of accurate orthographic projection reasoning in CAD, where traditional 3D reconstruction and standard fine-tuning struggle with precision and parametric editability. It proposes CReFT-CAD, a two-stage framework that first uses curriculum-driven reinforcement learning with difficulty-aware rewards to cultivate robust reasoning, then applies supervised post-tuning to sharpen instruction-following and semantic extraction. To support this, it introduces TriView2CAD, a large, open, multi-modal benchmark with $200{,}000$ synthetic and $3{,}000$ real-world projections spanning six data modalities and a $15$-dimensional parameter space, enabling evaluation of dimension recognition, counting, and composite parameter computation. Experiments show significant improvements over pretrained VLMs on in-domain data and notable generalization on real-world CAD data, with ablations demonstrating the importance of CoT reasoning and multi-task training. Together, these contributions provide a scalable training paradigm and a comprehensive benchmark to advance CAD orthographic-projection reasoning and its integration into design-to-manufacture pipelines.
Abstract
Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.
