Table of Contents
Fetching ...

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

Anna C. Doris, Daniele Grandi, Ryan Tomich, Md Ferdous Alam, Mohammadmehdi Ataei, Hyunmin Cheong, Faez Ahmed

TL;DR

DesignQA introduces a 1451-question benchmark linking Formula SAE design rules with MIT Motorsports CAD data to evaluate multimodal LLMs on document-grounded design tasks. The benchmark structurally divides tasks into Rule Extraction, Rule Comprehension, and Rule Compliance, enabling fine-grained analysis of retrieval, understanding, and verification against engineering requirements. Across baselines and state-of-the-art models, GPT-4o-AllRules generally achieves the best performance, but significant gaps remain in reliably extracting rules, recognizing CAD components, and analyzing engineering drawings, highlighting the need for improved cross-modal reasoning and longer-context reasoning in engineering contexts. The work provides a framework and data for ongoing evaluation and suggests directions like enhanced RAG, guaranteed-context retrieval, and improved visual grounding to advance AI-assisted engineering design.

Abstract

This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images, and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

TL;DR

DesignQA introduces a 1451-question benchmark linking Formula SAE design rules with MIT Motorsports CAD data to evaluate multimodal LLMs on document-grounded design tasks. The benchmark structurally divides tasks into Rule Extraction, Rule Comprehension, and Rule Compliance, enabling fine-grained analysis of retrieval, understanding, and verification against engineering requirements. Across baselines and state-of-the-art models, GPT-4o-AllRules generally achieves the best performance, but significant gaps remain in reliably extracting rules, recognizing CAD components, and analyzing engineering drawings, highlighting the need for improved cross-modal reasoning and longer-context reasoning in engineering contexts. The work provides a framework and data for ongoing evaluation and suggests directions like enhanced RAG, guaranteed-context retrieval, and improved visual grounding to advance AI-assisted engineering design.

Abstract

This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images, and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.
Paper Structure (32 sections, 3 figures, 5 tables)

This paper contains 32 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the three different segments (Rule Extraction, Rule Comprehension, and Rule Compliance) and six subsets (Retrieval, Compilation, Definition, Presence, Dimension, and Functional Performance) in DesignQA. Prompts and images shown above are condensed versions of the actual prompts and images used. The bottom right table shows the metrics and the number of questions for each subset of the benchmark.
  • Figure 2: Representing 3D CAD models in 2D images. A) Multi-view CAD image. B) Close-up CAD image. C-D) Engineering drawing images. C) uses the direct dimensioning method and D) uses the scale bar dimensioning method.
  • Figure 3: Sample responses from GPT-4-AllRules, GPT-4-RAG, and LLaVA-1.5-RAG across different subsets of the benchmark. We show the subsets that have evaluation metrics that can be harder to interpret, to provide references for various scores. The bolded portions of the predicted responses show what we interpreted to be correct.