Table of Contents
Fetching ...

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan

TL;DR

CFDLLMBench presents the first holistic benchmark suite for evaluating large language models on computational fluid dynamics tasks, spanning conceptual CFD knowledge (CFDQuery), CFD code generation (CFDCodeBench), and practical OpenFOAM workflow automation (FoamBench). By coupling domain-specific datasets with carefully designed, execution-oriented metrics (executable code, NMSE, numerical convergence, and ROUGE-based structural checks), the study reveals that while state-of-the-art models perform well on conceptual questions, their capabilities in generating correct CFD code and running valid simulations remain limited without agentic workflows. The authors also demonstrate that retrieval-augmented generation and human-in-the-loop reviewing can meaningfully boost performance in complex OpenFOAM tasks, underscoring the importance of tool-use and long-context reasoning for scientific automation. Overall, CFDLLMBench provides a robust, open-source platform to drive progress in LLM-enabled scientific computing and CFD workflow automation.

Abstract

Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

TL;DR

CFDLLMBench presents the first holistic benchmark suite for evaluating large language models on computational fluid dynamics tasks, spanning conceptual CFD knowledge (CFDQuery), CFD code generation (CFDCodeBench), and practical OpenFOAM workflow automation (FoamBench). By coupling domain-specific datasets with carefully designed, execution-oriented metrics (executable code, NMSE, numerical convergence, and ROUGE-based structural checks), the study reveals that while state-of-the-art models perform well on conceptual questions, their capabilities in generating correct CFD code and running valid simulations remain limited without agentic workflows. The authors also demonstrate that retrieval-augmented generation and human-in-the-loop reviewing can meaningfully boost performance in complex OpenFOAM tasks, underscoring the importance of tool-use and long-context reasoning for scientific automation. Overall, CFDLLMBench provides a robust, open-source platform to drive progress in LLM-enabled scientific computing and CFD workflow automation.

Abstract

Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

Paper Structure

This paper contains 28 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of CFDLLMBench: As the first ever LLM benchmark designed to holistically evaluate LLM's capabilities for CFD, it consists of three different tasks and datasets. (1) CFDQuery: Graduate-level CFD QA. (2) CFDCodeBench: Coding questions about solving common linear/nonlinear PDEs encountered in CFD. (3) FoamBench: Configuring OpenFOAM case files for simulating realistic engineering scenarios such as incompressible flow over obstacles, supersonic flow with shockwaves, Rayleigh-Benard convection, etc.
  • Figure 2: Success Rate comparison of different models across the three tasks. Success Rate is the fraction of cases in the benchmark that produce physically accurate results (higher is better). The detailed definition of Success Rate for each benchmark task can be found in \ref{['sec:metrics']}. The results for FoamBench are produced using the Foam-Agent framework with RAG, Reviewer, and Sonnet 3.5. There is a steep drop in performance from graduate-level knowledge (CFDQuery) to practical simulation workflow automation FoamBench.
  • Figure 3: Average metric score and Success Rate for CFDCodeBench. The Success Rate for even the best performing models are around 14%, suggesting the challenging nature of the problems in this benchmark.
  • Figure 4: Average metric score and Success Rate for different models on FoamBench using Foam-Agent framework with RAG and reviewer. The Success Rate for even the best performing model (Sonnet 3.5) is 34% in basic dataset and 25% in the advanced dataset.
  • Figure 5: Comparison of the geometry and mesh generated by the Foam-Agentyue2025foam (RAG and Reviewer) with Sonnet 3.5 for the doubleSquare case against human expert.
  • ...and 8 more figures