Table of Contents
Fetching ...

Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

Nicy Scaria, Silvester John Joseph Kennedy, Krishna Agarwal, Diksha Seth, Deepak Subramani

TL;DR

The paper targets the problem that final-answer accuracy in physics questions by small language models can hide flawed reasoning processes, which risks reinforcing student misconceptions. It introduces PhysBench, a large, structured OpenStax-based physics benchmark with Bloom's taxonomy annotations and 2,700 culturally contextualized variants, evaluated via a stage-wise rubric called P-REFS. Using 10 sub-4B SLMs and an LLM judge across approximately 58,000 responses, the study finds a substantial reliability gap: among final-answer-correct solutions, 75–98% contain at least one reasoning error, with weaker models failing early in interpretation and modeling, and stronger models failing during execution and validation. Contextual rewrites have minimal impact on top models but degrade mid-tier models, underscoring that safe educational AI requires diagnostics that prioritize reasoning fidelity over final correctness and that process-aware interventions are needed for deployment a in classroom settings.

Abstract

Small Language Models (SLMs) offer privacy and efficiency for educational deployment, yet their utility depends on reliable multistep reasoning. Existing benchmarks often prioritize final answer accuracy, obscuring 'right answer, wrong procedure' failures that can reinforce student misconceptions. This work investigates SLM physics reasoning reliability, stage wise failure modes, and robustness under paired contextual variants. We introduce Physbench, comprising of 3,162 high school and AP level physics questions derived from OpenStax in a structured reference solution format with Bloom's Taxonomy annotations, plus 2,700 paired culturally contextualized variants. Using P-REFS, a stage wise evaluation rubric, we assess 10 SLMs across 58,000 responses. Results reveal substantial reliability gap: among final answer correct solutions, 75 to 98% contain at least one reasoning error. Failure modes shift with model capability; weaker models fail primarily at interpretation or modeling while stronger models often fail during execution. Paired contextual variations have minimal impact on top models but degrade the performance of mid-tier models. These findings demonstrate that safe educational AI requires evaluation paradigms that prioritize reasoning fidelity over final-answer correctness.

Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

TL;DR

The paper targets the problem that final-answer accuracy in physics questions by small language models can hide flawed reasoning processes, which risks reinforcing student misconceptions. It introduces PhysBench, a large, structured OpenStax-based physics benchmark with Bloom's taxonomy annotations and 2,700 culturally contextualized variants, evaluated via a stage-wise rubric called P-REFS. Using 10 sub-4B SLMs and an LLM judge across approximately 58,000 responses, the study finds a substantial reliability gap: among final-answer-correct solutions, 75–98% contain at least one reasoning error, with weaker models failing early in interpretation and modeling, and stronger models failing during execution and validation. Contextual rewrites have minimal impact on top models but degrade mid-tier models, underscoring that safe educational AI requires diagnostics that prioritize reasoning fidelity over final correctness and that process-aware interventions are needed for deployment a in classroom settings.

Abstract

Small Language Models (SLMs) offer privacy and efficiency for educational deployment, yet their utility depends on reliable multistep reasoning. Existing benchmarks often prioritize final answer accuracy, obscuring 'right answer, wrong procedure' failures that can reinforce student misconceptions. This work investigates SLM physics reasoning reliability, stage wise failure modes, and robustness under paired contextual variants. We introduce Physbench, comprising of 3,162 high school and AP level physics questions derived from OpenStax in a structured reference solution format with Bloom's Taxonomy annotations, plus 2,700 paired culturally contextualized variants. Using P-REFS, a stage wise evaluation rubric, we assess 10 SLMs across 58,000 responses. Results reveal substantial reliability gap: among final answer correct solutions, 75 to 98% contain at least one reasoning error. Failure modes shift with model capability; weaker models fail primarily at interpretation or modeling while stronger models often fail during execution. Paired contextual variations have minimal impact on top models but degrade the performance of mid-tier models. These findings demonstrate that safe educational AI requires evaluation paradigms that prioritize reasoning fidelity over final-answer correctness.

Paper Structure

This paper contains 75 sections, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Multi-dimensional physics reasoning evaluation pipeline using OpenStax problems with structured answers and P-REFS (10-point rubric) for analyzing performance, errors, and contextualization effects.
  • Figure 2: Overview of the two-stage process for creating culturally contextualized physics benchmarks.
  • Figure 3: Part-wise performance breakdown for PhysBench. The height represents the average overall score (%).
  • Figure 4: Final answer vs. full correctness for conceptual and problem-solving questions. Gap percentages show correct answers with reasoning errors.
  • Figure 5: Performance across cultural contexts across PhysBench$_{\text{Contextual}}$ (900-question baseline subset), PhysBench$_{\text{Asia}}$, PhysBench$_{\text{Africa}}$, and PhysBench$_{\text{OcSA}}$.
  • ...and 13 more figures