Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Ruoxin Xiong; Yanyu Wang; Suat Gunhan; Yimin Zhu; Charles Berryman

Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Ruoxin Xiong, Yanyu Wang, Suat Gunhan, Yimin Zhu, Charles Berryman

TL;DR

This paper introduces CMExamSet, a curated dataset of 689 CM certification exam MCQs designed to benchmark state-of-the-art LLMs in construction management tasks. Using a zero-shot framework, it compares GPT-4o and Claude 3.7 across overall accuracy, subject areas, reasoning complexity, and question formats, benchmarking against human pass thresholds. The results show both models exceed typical certification pass marks but struggle notably with figure-referenced questions and multi-step reasoning, with error patterns dominated by conceptual misunderstandings, underscoring the need for domain-adaptive reasoning and human oversight. The work provides a domain-specific benchmarking framework with clear educational and industry implications, while acknowledging limitations and outlining future directions for more multimodal and domain-tuned AI systems in CM.

Abstract

The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.

Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

TL;DR

Abstract

Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)