Table of Contents
Fetching ...

EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Preslav Nakov, Zhuohan Xie

TL;DR

EngChain provides a symbolic benchmark for verifiable, multi-step engineering reasoning to address integration of physical laws, mathematical modeling, and practical constraints in engineering problems. The approach hinges on symbolic templates for scalable problem generation and a two-stage evaluation that first verifies intermediate steps and then analyzes errors with an automated LLM-as-a-Judge. The study demonstrates a persistent gap between final answer accuracy and procedural reasoning across 11 frontier models, highlighting a prevalence of Conceptual Errors and a surprising rate of Alternative Correct solutions. The findings motivate broader multi-domain, verifiable evaluation and guide future improvements in dataset design, evaluation metrics, and model training toward robust, principled engineering reasoning.

Abstract

Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.

EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

TL;DR

EngChain provides a symbolic benchmark for verifiable, multi-step engineering reasoning to address integration of physical laws, mathematical modeling, and practical constraints in engineering problems. The approach hinges on symbolic templates for scalable problem generation and a two-stage evaluation that first verifies intermediate steps and then analyzes errors with an automated LLM-as-a-Judge. The study demonstrates a persistent gap between final answer accuracy and procedural reasoning across 11 frontier models, highlighting a prevalence of Conceptual Errors and a surprising rate of Alternative Correct solutions. The findings motivate broader multi-domain, verifiable evaluation and guide future improvements in dataset design, evaluation metrics, and model training toward robust, principled engineering reasoning.

Abstract

Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.

Paper Structure

This paper contains 47 sections, 7 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: An example from EngChain (CSTR Volume Calculation). A symbolic template (1) generates a unique problem instance (2) and its verifiable, step-by-step chain-of-thought solution (3).
  • Figure 2: The taxonomy of EngChain.
  • Figure 3: Our template generation pipeline.
  • Figure 4: Generic multi-disciplinary template structure.
  • Figure 5: Automated error analysis with the LLM-as-a-Judge workflow.
  • ...and 6 more figures