Table of Contents
Fetching ...

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

TL;DR

SymPyBench delivers a dynamic, parameterized physics benchmark with 15,045 problems and executable Python ground-truths to probe robust scientific reasoning. It introduces a structured data pipeline (problem extraction, structured representation, template generation) and three novel evaluation metrics (Consistency Score, Failure Rate, Confusion Rate) to assess generalization under variation. The work demonstrates diverse question formats (free-form, MC-Symbolic, MC-Numerical) and analyzes state-of-the-art instruction-tuned models across these axes, revealing strengths in symbolic reasoning and gaps in numerical execution and formatting. This benchmark offers a foundation for building more reliable and interpretable reasoning systems in scientific domains.

Abstract

We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

TL;DR

SymPyBench delivers a dynamic, parameterized physics benchmark with 15,045 problems and executable Python ground-truths to probe robust scientific reasoning. It introduces a structured data pipeline (problem extraction, structured representation, template generation) and three novel evaluation metrics (Consistency Score, Failure Rate, Confusion Rate) to assess generalization under variation. The work demonstrates diverse question formats (free-form, MC-Symbolic, MC-Numerical) and analyzes state-of-the-art instruction-tuned models across these axes, revealing strengths in symbolic reasoning and gaps in numerical execution and formatting. This benchmark offers a foundation for building more reliable and interpretable reasoning systems in scientific domains.

Abstract

We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems

Paper Structure

This paper contains 34 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: An example from the SymPyBench dataset illustrating a free-form physics question. The figure shows a parameterized problem with variable input parameters, the final answer, detailed step-by-step reasoning, and the associated executable Python code. The question includes metadata such as domain, subdomain, and difficulty.
  • Figure 2: High-level pipeline diagram summarizing the workflow of creating SymPyBench.
  • Figure 3: Distribution of SymPyBench problems by number of sub-questions per instance.
  • Figure 4: Three variation of the same question with different input variables. Only the Qwen-7B model's final step responses are shown.