Table of Contents
Fetching ...

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Sifan Wu, Amir Khasahmadi, Mor Katz, Pradeep Kumar Jayaraman, Yewen Pu, Karl Willis, Bang Liu

TL;DR

CadVLM is proposed, an end-to-end vision language model for CAD generation that is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

Abstract

Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

TL;DR

CadVLM is proposed, an end-to-end vision language model for CAD generation that is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

Abstract

Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.
Paper Structure (15 sections, 7 equations, 9 figures, 4 tables)

This paper contains 15 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: For the CAD autocompletion task, our multi-modal CadVLM model (b) receives partial CAD entities as both image and text input (a) and generates the remaining sketch entities as output (c). The complete sketch, with optional predicted constraints, can then be used in CAD software (d) to form 3D shapes (e).More description about the primitive values in sketch text are in the Appendix.
  • Figure 2: CadVLM model architecture and training objectives for CAD autoconstraint. We propose a unified vision-language model capable of outputting both autocompleted sketch images and sketch primitives. The inputs are paired with incomplete sketch primitives and images, which are encoded by pre-trained text encoder and image encoder correspondingly. Aligned by image-text contrastive(ITC) loss, the concatenation of image embedding and text embedding then will be input to the text-grounded image decoder and image-grounded text decoder. The image decoder is trained with image decoding loss between autocompleted sketch images and ground-truth sketch images. The text decoder is trained with a language modeling(LM) loss to generate autocompleted primitive.
  • Figure 3: CAD image reconstruction results by ViTMAE. (a) Input sketches, (b) masked sketches with the mask ratio of 75%, reconstructed results using a (c) pre-trained and (d) further fine-tuned ViT-MAE.
  • Figure 4: Comparative Analysis of Autocompletion in CAD Design. Top row: Random samples from the SketchGraphs test dataset. Second row: Initial input entities serving as the primer for the autocompletion task. Third and fourth row: Autocompletion results produced by Vitruvion and CadVLM.
  • Figure 5: Effect of input entity ratio to (a) Entity Accuracy, (b) Sketch Accuracy, and (c) CAD F1.
  • ...and 4 more figures