Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

Darioush Kevian; Usman Syed; Xingang Guo; Aaron Havens; Geir Dullerud; Peter Seiler; Lianhui Qin; Bin Hu

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

Darioush Kevian, Usman Syed, Xingang Guo, Aaron Havens, Geir Dullerud, Peter Seiler, Lianhui Qin, Bin Hu

TL;DR

This study benchmarks GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra on undergraduate control problems using the ControlBench dataset, with expert-panel evaluations highlighting Claude as the state-of-the-art in this domain while revealing persistent challenges with visual plots. It demonstrates how self-correction prompts affect accuracy and discusses the importance of prompt design, data interpretation, and calculation reliability. A simplified ControlBench-C is introduced to enable non-control experts to perform rapid automated assessments, though it cannot fully replace expert analysis. The work outlines concrete future directions, including dataset expansion, control-oriented prompting, advanced reasoning strategies, automated evaluation workflows, and the integration of vision-language capabilities for plotting tasks. Overall, the paper marks an initial but important step toward integrating LLMs into control-engineering education and research, while identifying gaps to be addressed for reliable deployment.

Abstract

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

TL;DR

Abstract

Paper Structure (16 sections, 19 equations, 2 figures, 3 tables)

This paper contains 16 sections, 19 equations, 2 figures, 3 tables.

Introduction
Motivating Example: A Showcase for LLM Capabilities in Control Design
The ControlBench Dataset
Evaluations of Leading LLMs on ControlBench
Statistical Accuracy Analysis
Strengths of LLMs and Successful Examples
Analysis and Insights for Failure Modes
Discussions on Self-Correction Capabilities
Sensitivity to the Problem Statements
ControlBench-C: Facilitating Evaluations by Non-Control Experts
Conclusion and Future Work
Expansion of the problem set.
Control-oriented prompting.
Improving reasoning capabilities and tool use abilities for consistency and accuracy.
Efficient evaluation.
...and 1 more sections

Figures (2)

Figure 1: Proportions (%) of seven error types (#errors / #total cases) for GPT-4 and Claude 3 Opus, including portion of correct answers for quick comparison
Figure 2: Bode Plot Example

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

TL;DR

Abstract

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

Authors

TL;DR

Abstract

Table of Contents

Figures (2)