Table of Contents
Fetching ...

MathBode: Measuring the Stability of LLM Reasoning using Frequency Response

Charles L. Wang

TL;DR

MathBode introduces a dynamic, frequency-domain diagnostic for LLM mathematical reasoning by sinusoidally driving a parametric problem and extracting first-harmonic gain and phase to produce Bode-style fingerprints. Across five closed-form families, the method reveals consistent low-pass behavior and phase lag not captured by static final-answer accuracy, using a symbolic baseline to calibrate measurements. The framework yields robust metrics (MB-Core/MB-Plus) and diagnostics (R^2, residuals, ACF) that quantify reasoning fidelity, consistency, and prompt sensitivity, offering a reproducible complement to traditional benchmarks. By open-sourcing the dataset and code, the work enables broader evaluation and incremental improvements in dynamic reasoning quality for LLMs.

Abstract

This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $φ\approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.

MathBode: Measuring the Stability of LLM Reasoning using Frequency Response

TL;DR

MathBode introduces a dynamic, frequency-domain diagnostic for LLM mathematical reasoning by sinusoidally driving a parametric problem and extracting first-harmonic gain and phase to produce Bode-style fingerprints. Across five closed-form families, the method reveals consistent low-pass behavior and phase lag not captured by static final-answer accuracy, using a symbolic baseline to calibrate measurements. The framework yields robust metrics (MB-Core/MB-Plus) and diagnostics (R^2, residuals, ACF) that quantify reasoning fidelity, consistency, and prompt sensitivity, offering a reproducible complement to traditional benchmarks. By open-sourcing the dataset and code, the work enables broader evaluation and incremental improvements in dynamic reasoning quality for LLMs.

Abstract

This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument (, ). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.

Paper Structure

This paper contains 26 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Gain vs. frequency. Panels are families; curves overlay models (unity $G{=}1$ dashed). Mid-band ({4,8}) deviations indicate under/over-reaction despite identical ground truth.
  • Figure 2: Phase error vs. frequency. Signed model–truth phase (rad), wrapped to $(-\pi,\pi]$; $0^\circ$ implies perfect timing.
  • Figure 3: Residual ACF(1) vs. frequency. Near-zero ACF(1) means little temporal structure remains after the harmonic fit; negative values align with alternating over/undershoots at higher frequencies.
  • Figure 4: First-harmonic fit quality ($R^2$) vs. frequency. High $R^2$ validates a single-sinusoid description; dips signal nonlinear distortion or prompt-surface effects.
  • Figure 5: A3. Compliance by family. Compliance is perfect overall.
  • ...and 2 more figures