MathBode: Measuring the Stability of LLM Reasoning using Frequency Response
Charles L. Wang
TL;DR
MathBode introduces a dynamic, frequency-domain diagnostic for LLM mathematical reasoning by sinusoidally driving a parametric problem and extracting first-harmonic gain and phase to produce Bode-style fingerprints. Across five closed-form families, the method reveals consistent low-pass behavior and phase lag not captured by static final-answer accuracy, using a symbolic baseline to calibrate measurements. The framework yields robust metrics (MB-Core/MB-Plus) and diagnostics (R^2, residuals, ACF) that quantify reasoning fidelity, consistency, and prompt sensitivity, offering a reproducible complement to traditional benchmarks. By open-sourcing the dataset and code, the work enables broader evaluation and incremental improvements in dynamic reasoning quality for LLMs.
Abstract
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $φ\approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
