Table of Contents
Fetching ...

Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

Shiruo Hu, Wenbo Shan, Yingjia Li, Zhiqi Wan, Xinpeng Yu, Yunjia Qi, Haotian Xia, Yang Xiao, Dingxiao Liu, Jiaru Wang, Chenxu Gong, Ruixi Zhang, Shuyue Wu, Shibo Cui, Chee Hui Lai, Wei Luo, Yubin He, Bin Xu, Jianshi Zhao

TL;DR

The paper introduces Hydro-SE Bench, the first domain-specific benchmark for evaluating LLMs in Hydro-Science and Engineering across nine subfields and three cognitive types, using 4,000 questions generated from diverse sources with expert verification. It evaluates 16 LLMs (10 commercial, 6 open-source) and finds large models achieve 0.74–0.80 accuracy while smaller models lag at 0.41–0.68, with scaling predominantly improving reasoning and calculation rather than basic knowledge or engineering tasks. Subfield performance reveals stronger results in physics-grounded areas (HRD, M, PS) and weaker performance in rapidly updating or highly specialized topics (IS, BK, ESM), underscoring the need for domain-adaptive training. The study also introduces a verbalized confidence estimation approach and a difficulty-consistency analysis to enable efficient benchmarking, and it highlights calibration challenges that must be addressed for safe, reliable Hydro-SE deployment.

Abstract

Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.

Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

TL;DR

The paper introduces Hydro-SE Bench, the first domain-specific benchmark for evaluating LLMs in Hydro-Science and Engineering across nine subfields and three cognitive types, using 4,000 questions generated from diverse sources with expert verification. It evaluates 16 LLMs (10 commercial, 6 open-source) and finds large models achieve 0.74–0.80 accuracy while smaller models lag at 0.41–0.68, with scaling predominantly improving reasoning and calculation rather than basic knowledge or engineering tasks. Subfield performance reveals stronger results in physics-grounded areas (HRD, M, PS) and weaker performance in rapidly updating or highly specialized topics (IS, BK, ESM), underscoring the need for domain-adaptive training. The study also introduces a verbalized confidence estimation approach and a difficulty-consistency analysis to enable efficient benchmarking, and it highlights calibration challenges that must be addressed for safe, reliable Hydro-SE deployment.

Abstract

Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.

Paper Structure

This paper contains 18 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Distribution of questions in Hydro-Science and Engineering (Hydro-SE) Benchmark across subfields. Questions in subfields are manually classified in A (basic conceptual knowledge), B (engineering applications), and C (reasoning and calculation) types. Questions in Hydro-SE Bench comprises single-choice questions (SCQs) and multi-choice questions (MCQs).
  • Figure 2: Overall performance of LLMs on Hydro-SE Bench. Accuracy represents the ratio of correctly answered questions to the total number of questions, calculated separately for type A (basic conceptual knowledge), B (engineering applications), and C (reasoning and calculation) questions in the Hydro-SE Bench. (a) The results of ten commercial LLMs. (b) The results of six small-parameter open-source LLMs. In both subfigures, models are arranged from left to right in descending order of overall accuracy.
  • Figure 3: Performance of ten commercial LLMs across different subfields of Hydro-SE Bench. Performance is measured by the ratio of correctly answered questions to the total number of questions, calculated by each subfield. A larger colored area indicates better performance. Subfields are represented by abbreviations (BK: Background Knowledge; IS: Industry Standard; HWR: Hydrology and Water Resources; GE: Geotechnical Engineering; HSE: Hydraulic Structures and Equipment; ESM: Engineering Safety and Management; HRD: Hydraulics and River Dynamic; M: Meteorology; PS: Power System). The numbers in parentheses following each abbreviation denote the average accuracy value of the ten models in that subfield.
  • Figure 4: Comparison of LLMs across different parameter scales. (a) The bar chart shows the average accuracy values of large-parameter LLMs and small-parameter LLMs, with error bars representing the standard deviations of model accuracy values. (b) The radar chart illustrates the average accuracy values of large-parameter and small-parameter LLMs across nine Hydro-SE subfields. The numbers marked along the arrows indicate the accuracy differences between large- and small-parameter models.
  • Figure 5: Distribution of LLMs’ confidence estimates for Hydro-SE Bench. The bar plot shows the distribution of LLMs’ confidence estimates (with the number of questions labeled by the height of the bar). The line plot shows the average accuracy for questions in each confidence level. LLMs’ confidence estimate would be well calibrated if the average accuracy increases with the confidence. The dash black line represents the prefect calibration line, where accuracy increases monotonically with the confidence level.
  • ...and 3 more figures