EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering
Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Preslav Nakov, Zhuohan Xie
TL;DR
EngChain provides a symbolic benchmark for verifiable, multi-step engineering reasoning to address integration of physical laws, mathematical modeling, and practical constraints in engineering problems. The approach hinges on symbolic templates for scalable problem generation and a two-stage evaluation that first verifies intermediate steps and then analyzes errors with an automated LLM-as-a-Judge. The study demonstrates a persistent gap between final answer accuracy and procedural reasoning across 11 frontier models, highlighting a prevalence of Conceptual Errors and a surprising rate of Alternative Correct solutions. The findings motivate broader multi-domain, verifiable evaluation and guide future improvements in dataset design, evaluation metrics, and model training toward robust, principled engineering reasoning.
Abstract
Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.
