Table of Contents
Fetching ...

CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning

Ke Niu, Zhuofan Chen, Haiyang Yu, Yuwen Chen, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue

TL;DR

This work tackles the challenge of accurate orthographic projection reasoning in CAD, where traditional 3D reconstruction and standard fine-tuning struggle with precision and parametric editability. It proposes CReFT-CAD, a two-stage framework that first uses curriculum-driven reinforcement learning with difficulty-aware rewards to cultivate robust reasoning, then applies supervised post-tuning to sharpen instruction-following and semantic extraction. To support this, it introduces TriView2CAD, a large, open, multi-modal benchmark with $200{,}000$ synthetic and $3{,}000$ real-world projections spanning six data modalities and a $15$-dimensional parameter space, enabling evaluation of dimension recognition, counting, and composite parameter computation. Experiments show significant improvements over pretrained VLMs on in-domain data and notable generalization on real-world CAD data, with ablations demonstrating the importance of CoT reasoning and multi-task training. Together, these contributions provide a scalable training paradigm and a comprehensive benchmark to advance CAD orthographic-projection reasoning and its integration into design-to-manufacture pipelines.

Abstract

Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.

CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning

TL;DR

This work tackles the challenge of accurate orthographic projection reasoning in CAD, where traditional 3D reconstruction and standard fine-tuning struggle with precision and parametric editability. It proposes CReFT-CAD, a two-stage framework that first uses curriculum-driven reinforcement learning with difficulty-aware rewards to cultivate robust reasoning, then applies supervised post-tuning to sharpen instruction-following and semantic extraction. To support this, it introduces TriView2CAD, a large, open, multi-modal benchmark with synthetic and real-world projections spanning six data modalities and a -dimensional parameter space, enabling evaluation of dimension recognition, counting, and composite parameter computation. Experiments show significant improvements over pretrained VLMs on in-domain data and notable generalization on real-world CAD data, with ablations demonstrating the importance of CoT reasoning and multi-task training. Together, these contributions provide a scalable training paradigm and a comprehensive benchmark to advance CAD orthographic-projection reasoning and its integration into design-to-manufacture pipelines.

Abstract

Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) (b) show the results of parameterization tasks of orthographic projection reasoning using Qwen2.5-VL and Deepseek without tuning. (c)–(f) illustrate CReFT-CAD’s capabilities across various orthographic projection reasoning tasks.
  • Figure 2: Constraint-driven synthesis pipeline for TriView2CAD.
  • Figure 3: Examples of Real-World Orthographic Projections and its 3D model.
  • Figure 4: Diagram design of three tasks in curriculum‐driven reinforcement fine-tuning.
  • Figure 5: Four sets of failure cases. Each set consists of three parts: the left part shows the orthographic projection, the middle part presents the failed 3D model construction, and the right part illustrates the correctly constructed 3D model.