Table of Contents
Fetching ...

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

Yunqing Liu, Nan Zhang, Zhiming Tan

TL;DR

This work proposes a 2-stage framework with inference-time adaptation that combines corrected Error Notebooks with RAG to substantially improve VLM-based part retrieval reasoning, and surpasses other training-free baselines and yields substantial improvements also for open-source VLMs.

Abstract

Effective specification-aware part retrieval within complex CAD assemblies is essential for automated engineering tasks. However, using LLMs/VLMs for this task is challenging: the CAD model metadata sequences often exceed token budgets, and fine-tuning high-performing proprietary models (e.g., GPT or Gemini) is unavailable. Therefore, we need a framework that delivers engineering value by handling long, non-natural-language CAD model metadata using VLMs, but without training. We propose a 2-stage framework with inference-time adaptation that combines corrected Error Notebooks with RAG to substantially improve VLM-based part retrieval reasoning. Each Error Notebook is built by correcting initial CoTs through reflective refinement, and then filtering each trajectory using our proposed grammar-constraint (GC) verifier to ensure structural well-formedness. The resulting notebook forms a high-quality repository of specification-CoT-answer triplets, from which RAG retrieves specification-relevant exemplars to condition the model's inference. We additionally contribute a CAD dataset with human preference annotations. Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark. The proposed GC verifier can further produce up to +4.5 accuracy points. Our approach also surpasses other training-free baselines (standard few-shot learning, self-consistency) and yields substantial improvements also for open-source VLMs (Qwen2-VL-2B-Instruct, Aya-Vision-8B). Under the cross-model GC setting, where the Error Notebook is constructed using GPT-4o (Omni), the 2B model inference achieves performance that comes within roughly 4 points of GPT-4o mini.

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

TL;DR

This work proposes a 2-stage framework with inference-time adaptation that combines corrected Error Notebooks with RAG to substantially improve VLM-based part retrieval reasoning, and surpasses other training-free baselines and yields substantial improvements also for open-source VLMs.

Abstract

Effective specification-aware part retrieval within complex CAD assemblies is essential for automated engineering tasks. However, using LLMs/VLMs for this task is challenging: the CAD model metadata sequences often exceed token budgets, and fine-tuning high-performing proprietary models (e.g., GPT or Gemini) is unavailable. Therefore, we need a framework that delivers engineering value by handling long, non-natural-language CAD model metadata using VLMs, but without training. We propose a 2-stage framework with inference-time adaptation that combines corrected Error Notebooks with RAG to substantially improve VLM-based part retrieval reasoning. Each Error Notebook is built by correcting initial CoTs through reflective refinement, and then filtering each trajectory using our proposed grammar-constraint (GC) verifier to ensure structural well-formedness. The resulting notebook forms a high-quality repository of specification-CoT-answer triplets, from which RAG retrieves specification-relevant exemplars to condition the model's inference. We additionally contribute a CAD dataset with human preference annotations. Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark. The proposed GC verifier can further produce up to +4.5 accuracy points. Our approach also surpasses other training-free baselines (standard few-shot learning, self-consistency) and yields substantial improvements also for open-source VLMs (Qwen2-VL-2B-Instruct, Aya-Vision-8B). Under the cross-model GC setting, where the Error Notebook is constructed using GPT-4o (Omni), the 2B model inference achieves performance that comes within roughly 4 points of GPT-4o mini.

Paper Structure

This paper contains 18 sections, 10 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Scope of work. The goal is to retrieve symbolic part identifiers from long, non-natural-language assembly CAD model using a natural-language specification. Our two-stage VLM pipeline first converts CAD model part information into geometric descriptions (1st VLM), then performs specification-aware reasoning (2nd VLM) assisted by Error Notebook with RAG.
  • Figure 2: Overview of the (a) dataset construction pipeline and (b) Error Notebook + RAG-based inference process. (a) For each assembly, a VLM is used to generate concise and discriminative natural language descriptions for every part. Subsequently, the model generates assembly-level specification sentences describing the required relationship. To support human annotation, the specified parts are merged and visualized as a CAD model image. (b) Following the 1st VLM, at the 2nd stage, given the assembly specification, the system retrieves the most relevant examples from the Error Notebook according to the assembly specification, incorporates these as few-shot exemplars, and then performs step-by-step reasoning to generate the final answer.
  • Figure 3: Error Notebook construction. We define a corrected reasoning trajectory as the concatenation of: 1) all steps up to the first error, 2) a natural language reflection that pinpoints and transitions from the error, and 3) the corrected reasoning steps that ultimately yield the ground-truth answer. The proposed GC check is further employed to improve the quality of the Error Notebook.
  • Figure A.1: Effect of CoT reasoning and exemplar number on retrieval accuracy across different assembly complexities and datasets. Top row: results on the self-generated dataset; bottom row: results on the human preference dataset. (a) For simple assemblies ($<10$ parts). (b) For more complex assemblies (10–50 parts). The $x$-axis indicates the number of exemplars retrieved from the Error Notebook, where each exemplar consists of either (i) the final corrected answer only (Non-CoT group) or (ii) the corrected CoT reasoning steps plus the final answer (CoT group).
  • Figure A.2: Accuracy comparison between proposed pipeline and image-only reasoning. Performance is shown for the proposed pipeline, which leverages part descriptions as intermediate references, versus the one that directly reasons over images.
  • ...and 5 more figures