Table of Contents
Fetching ...

GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback

Giorgio Giannone, Anna Clare Doris, Amin Heyrani Nobari, Kai Xu, Akash Srivastava, Faez Ahmed

Abstract

Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.

GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback

Abstract

Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.

Paper Structure

This paper contains 62 sections, 18 equations, 24 figures, 12 tables, 1 algorithm.

Figures (24)

  • Figure 1: Efficiency vs. Performance. We compare Pass@k (test set IoU) across compute budgets. GIFT (green) matches the peak performance of the CAD-Coder-SFT baseline (orange) while using 80% less compute (requiring far fewer samples). GIFT outperforms both SFT and GIFT-REJECT at every compute level, demonstrating that self-training with geometric feedback significantly enhances image-conditional CAD generation. Results reflect mean IoU on the GenCAD test subset, including failure cases. An extended scaling analysis is provided in Appendix \ref{['appx:ablation']} (Figure \ref{['appx:gift-components']}).
  • Figure 2: Robustness Analysis. ($a$) GIFT achieves superior accuracy (Mean/Median IoU). ($b, c$) While all models degrade with increased task complexity (token length), GIFT maintains higher resilience than SFT. ($d$) GIFT solves 53% more problems compared to the baseline, highlighting the benefit of diverse training data.
  • Figure 3: The CAD-Coder pipeline processes multimodal inputs (text prompts and images) to generate executable CAD code. This code is converted into STEP files via geometric alignment and validated against the ground truth using IoU metrics.
  • Figure 4: Comparison of Vision-Language Model generations: (a) standard text description vs. (b) executable CAD code.
  • Figure 5: Standard SFT Baseline. The model is trained on static image-code pairs using a next-token prediction objective.
  • ...and 19 more figures