DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
Anna C. Doris, Daniele Grandi, Ryan Tomich, Md Ferdous Alam, Mohammadmehdi Ataei, Hyunmin Cheong, Faez Ahmed
TL;DR
DesignQA introduces a 1451-question benchmark linking Formula SAE design rules with MIT Motorsports CAD data to evaluate multimodal LLMs on document-grounded design tasks. The benchmark structurally divides tasks into Rule Extraction, Rule Comprehension, and Rule Compliance, enabling fine-grained analysis of retrieval, understanding, and verification against engineering requirements. Across baselines and state-of-the-art models, GPT-4o-AllRules generally achieves the best performance, but significant gaps remain in reliably extracting rules, recognizing CAD components, and analyzing engineering drawings, highlighting the need for improved cross-modal reasoning and longer-context reasoning in engineering contexts. The work provides a framework and data for ongoing evaluation and suggests directions like enhanced RAG, guaranteed-context retrieval, and improved visual grounding to advance AI-assisted engineering design.
Abstract
This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images, and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.
