Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Usman Syed; Ethan Light; Xingang Guo; Huan Zhang; Lianhui Qin; Yanfeng Ouyang; Bin Hu

Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Usman Syed, Ethan Light, Xingang Guo, Huan Zhang, Lianhui Qin, Yanfeng Ouyang, Bin Hu

TL;DR

TransportBench provides a formal benchmark to assess LLMs on undergraduate transportation engineering problems, spanning planning, design, and operations. The study evaluates accuracy, consistency, and reasoning behaviors of several leading LLMs in zero-shot and self-checking contexts, revealing Claude 3.5 Sonnet as the top performer overall, with GPT-4o close behind and Llama 3.1 approaching GPT-4 levels. Results also show that True/False problems are typically easier than general Q&A, and that performance can degrade under self-checking prompts, underscoring the complexity of reliable LLM reasoning in domain tasks. The work highlights both the promise of LLMs for transportation engineering and the need for targeted prompting, domain adaptation, and tooling to translate AI capabilities into robust, real-world practice.

Abstract

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset that includes a sample of transportation engineering problems on a wide range of subjects in the context of planning, design, management, and control of transportation systems. This dataset is used by human experts to evaluate the capabilities of various commercial and open-sourced LLMs, especially their accuracy, consistency, and reasoning behaviors, in solving transportation engineering problems. Our comprehensive analysis uncovers the unique strengths and limitations of each LLM, e.g. our analysis shows the impressive accuracy and some unexpected inconsistent behaviors of Claude 3.5 Sonnet in solving TransportBench problems. Our study marks a thrilling first step toward harnessing artificial general intelligence for complex transportation challenges.

Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

TL;DR

Abstract

Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Authors

TL;DR

Abstract

Table of Contents