Table of Contents
Fetching ...

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Jingwei Xu, Chenyu Wang, Zibo Zhao, Wen Liu, Yi Ma, Shenghua Gao

TL;DR

CAD-MLLM introduces a unified multimodal framework for parametric CAD generation conditioned on text, images, and point clouds using a large language model with specialized cross-modal alignment and LoRA fine-tuning. It leverages a new Omni-CAD dataset, transforming CAD command sequences into a learnable representation and pairing them with multimodal data to enable conditional generation. The approach introduces four topology- and enclosure-aware metrics and demonstrates state-of-the-art performance across point, image, text, and multimodal conditioning, with strong robustness to noisy and partial inputs and good generalization to unseen data. This work significantly lowers the barrier for non-experts to create precise CAD models by integrating natural modalities into a single generation framework.

Abstract

This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user's inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models' vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: https://cad-mllm.github.io/

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

TL;DR

CAD-MLLM introduces a unified multimodal framework for parametric CAD generation conditioned on text, images, and point clouds using a large language model with specialized cross-modal alignment and LoRA fine-tuning. It leverages a new Omni-CAD dataset, transforming CAD command sequences into a learnable representation and pairing them with multimodal data to enable conditional generation. The approach introduces four topology- and enclosure-aware metrics and demonstrates state-of-the-art performance across point, image, text, and multimodal conditioning, with strong robustness to noisy and partial inputs and good generalization to unseen data. This work significantly lowers the barrier for non-experts to create precise CAD models by integrating natural modalities into a single generation framework.

Abstract

This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user's inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models' vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: https://cad-mllm.github.io/

Paper Structure

This paper contains 32 sections, 5 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: A simple example about the construction process of a CAD model with command sequence representation. Starting with a sketch operation on a chosen 2D plane, the extrusion operation then "drags" this 2D sketch into a 3D solid volume. Further editing requires another extruded 3D solid volume. Subsequently, the union will "merge" these two 3D solids into a single integrated solid. Other boolean operators from Constructive Solid Geometry (CSG) support the construction of more complex geometries. As a result, this CAD model can be represented with these command sequences.
  • Figure 2: Qualitative comparison between our CAD command sequence dataset and DeepCAD DeepCAD dataset. According to Sec. \ref{['sec:create_dataset']}, DeepCAD dataset is part of our created dataset. In the visualization of our dataset, we exclude the CAD models' IDs that have been included in the DeepCAD dataset. The extension part of our dataset contains more complex and realistic models with more details. Best viewed zoomed in.
  • Figure 3: The statistical comparison between our dataset and DeepCAD DeepCAD dataset. The statistics are conducted before data augmentation. The charts indicate that our dataset extends the data over a wide range of sequence counts and extrusion operation counts with more challenging cases.
  • Figure 4: Our network architecture. The network could process three single modalities of information of input or any combinations of them, each uniquely color-coded. We consider the most complex combination of modalities, where three different inputs are provided simultaneously. Except for the textual descriptions, each modality is first processed through its corresponding frozen encoder before being further integrated. Subsequently, they are passed through a trainable projection layer, aligning them within a unified language feature space. The fine-tuned Large Language Models (LLMs), augmented with Low-Rank Adaptation (LoRA), then process a combination of the prompt and the projected embeddings, enabling the accurate generation of CAD models.
  • Figure 5: We present qualitative point-based reconstruction results on our dataset and compare our generative method with the point-based B-rep reconstruction baseline. Blue lines highlight the dangling edges in the reconstructed model. Our method produces high-fidelity reconstructed results. Most of our reconstructed results are strict manifolds and do not have dangling edges (do not have blue lines). The results of the comparison of reconstruction baselines show that they have lots of dangling edges. This figure illustrates that our method outperforms from the topological aspect.
  • ...and 9 more figures