PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

Yuanbo Li; Dule Shu; Yanying Chen; Matt Klenk; Daniel Ritchie

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

Yuanbo Li, Dule Shu, Yanying Chen, Matt Klenk, Daniel Ritchie

TL;DR

The paper tackles the lack of paired CAD-program data by proposing PLLM, a self-training framework that leverages unlabeled 3D shapes to synthesize supervision for CAD program synthesis. It uses a pre-trained CAD-capable LLM to generate candidate programs, executes them, and selectively retains high-fidelity program–shape pairs, then expands and diversifies programs via program-level edits before fine-tuning. Applied to adapt CAD-Recode from DeepCAD to the ABC dataset, PLLM achieves consistent improvements in geometric fidelity and program diversity across iterations. This data-centric approach reduces reliance on manual annotations and enables scalable adaptation to new CAD languages and domains.

Abstract

Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

TL;DR

Abstract

Paper Structure (23 sections, 5 figures, 1 table)

This paper contains 23 sections, 5 figures, 1 table.

Introduction
Related Works
Self Supervised Training
Learning to Recover CAD Programs
Method
Program Sampling
Program-Level Data Augmentation
Program Expansion
Program Shortening
Training Data Pairs
Implementation
CAD-Recode
LoRA Fine-Tuning
Computational Cost
Results and Evaluations
...and 8 more sections

Figures (5)

Figure 1: We show the overall pipeline in (a). At each iteration, the model first takes an input shape and samples multiple candidate programs. The selection algorithm then identifies the best program–shape pairs, which are used for training in the next iteration. (b) illustrates the details of the program length diversification process, where we perform both program expansion and shortening to create additional variants. The edited programs serve as labels $Z$, and their corresponding executions are treated as inputs $X$ to form the new training dataset.
Figure 2: We compare quantitative results across iterations: (a) Chamfer Distance, (b) IoU, and (c) Program Length.
Figure 3: Overview of different baseline strategies compared in our study. The figure illustrates how each baseline constructs its $(X, Z)$ training pairs. Baseline 1 uses the generated program and its execution; Baseline 2 uses the input shape and its best generated program; and Baseline 3 samples within each batch, selecting only the top 20% of high-performing pairs. Our proposed method further introduces program expansion and shortening to generate paired data $(X, Z)$ that better align with the target distribution.
Figure 4: Comparison between our results and those produced by CAD-Recode, which correspond to the outputs from the first iteration of our framework
Figure 5: Results across different iterations, showing that the generated shapes gradually improve in quality as training progresses

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

TL;DR

Abstract

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (5)