Table of Contents
Fetching ...

BrepLLM: Native Boundary Representation Understanding with Large Language Models

Liyuan Deng, Hao Guo, Yunpeng Bai, Yongkang Dai, Huaxi Huang, Yilei Shi

TL;DR

BrepLLM tackles the challenge of direct Brep understanding by bridging structured 3D geometry with natural language. It introduces adaptive UV sampling and a hierarchical BrepEncoder to produce global and token-level geometric representations, trained with CLIP-style cross-modal alignment. A three-stage LLM fine-tuning pipeline, including a residual Mixture-of-Query Experts, enables deep geometric reasoning within language models. The Brep2Text dataset provides a large benchmark for Brep-centric tasks, and experiments show state-of-the-art performance on 3D captioning and generative classification, highlighting the practical impact of direct Brep–language reasoning for CAD applications.

Abstract

Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

BrepLLM: Native Boundary Representation Understanding with Large Language Models

TL;DR

BrepLLM tackles the challenge of direct Brep understanding by bridging structured 3D geometry with natural language. It introduces adaptive UV sampling and a hierarchical BrepEncoder to produce global and token-level geometric representations, trained with CLIP-style cross-modal alignment. A three-stage LLM fine-tuning pipeline, including a residual Mixture-of-Query Experts, enables deep geometric reasoning within language models. The Brep2Text dataset provides a large benchmark for Brep-centric tasks, and experiments show state-of-the-art performance on 3D captioning and generative classification, highlighting the practical impact of direct Brep–language reasoning for CAD applications.

Abstract

Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

Paper Structure

This paper contains 16 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: BrepLLM, trained to directly understand Brep data, enables natural interaction through both text and Brep data. The model interprets Brep data as input and provides accurate, text-based responses to user queries about CAD parts.
  • Figure 2: Overview of the BrepLLM architecture. The framework consists of two steps. Step 1 (Left): Cross-modal Alignment Pre-training. BrepEncoder processes the Brep model to produce a global feature. This feature is then aligned with text embeddings from a frozen CLIP Text Encoder (ViT-L/14) using a contrastive loss. Step 2 (Right): Multi-stage LLM Fine-tuning. The frozen BrepEncoder's node tokens are progressively aligned with the LLM. Stage I trains an MLP to map the node tokens; Stage II fine-tunes the Q-Former (LoRA) and LLM (LoRA); Stage III introduces a Mixture of Query Experts (MQE).
  • Figure 3: The overview of BrepEncoder. (a) B-rep parameterization using area-adaptive UV sampling for faces and length-adaptive sampling for edges, producing the face attribute tensor $\mathbf{X}_{\mathcal{S}}$ and edge attribution tensor $\mathbf{X}_{\mathcal{C}}$. (b) Hierarchical BrepEncoder. Face features $F_f$, edge-conditioned features $F_e$, and global topology features $F_t$ are extracted from per-node tokens $\mathbf{h}_i$. A global graph feature $\mathbf{h}_{\text{cls}}$ is obtained via global attention pooling.