Table of Contents
Fetching ...

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty

TL;DR

MarODE, an offline evaluation framework that assigns quality scores to reasoning traces, is introduced, and its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness.

Abstract

Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

TL;DR

MarODE, an offline evaluation framework that assigns quality scores to reasoning traces, is introduced, and its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness.

Abstract

Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
Paper Structure (49 sections, 41 equations, 7 figures, 4 tables)

This paper contains 49 sections, 41 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Distribution of reasoning trace quality scores across different prompting regimes. Violin plots illustrate the score distributions for five evaluation metrics -- ROSCOE, LLM-as-a-Judge, Local and Global Coherence, ReCEval, and MarODE, under three prompting settings (1, 2, and 4-shot) used to generate reasoning traces. For a given baseline, the overall shape of the distribution remains largely consistent across different shot counts, suggesting relative stability of the metric with respect to prompting variation. In contrast, distinct baselines exhibit markedly different distributional shapes, reflecting differences in scale, sensitivity, and scoring behavior across different evaluation approaches.
  • Figure 2: Comparative performance of post-hoc reasoning evaluation metrics across datasets and models, visualized as radar charts. Each subplot corresponds to a specific dataset (LIAR and PolitiFact under 1, 2, and 4-shot prompting), with axes representing the five backbone models. Colored lines denote metric-wise Somers’ $D$ correlations with human-centric perturbations for ROSCOE, Local-Global, LLM-as-a-Judge, ReCEval, and our proposed MarODE. MarODE consistently exhibits the highest correlations across datasets, shot counts, and model families, indicating stronger sensitivity to perturbations that affect reasoning quality.
  • Figure 3: An overview of the MarODE framework, which integrates three complementary components -- Markovian Coherence, Directional Consistency, and Evidence Alignment, to capture a generalist notion of completeness and soundness in reasoning traces. (a) Markovian Coherence models the local flow of reasoning, where each step in a high-quality reasoning trace is meaningfully connected to its predecessor while also facilitating progression toward a successor, often with controlled informational overlap. (b) Directional Consistency enforces a crisp, one-directional progression of reasoning, ensuring that each step advances toward a conclusion rather than oscillating or regressing. (c) Redundancy, Redundant steps that fail to contribute significant new information not only increase the length of the reasoning chain but also compromise its overall quality and its ability to converge on a certain conclusion. (d) Evidence Alignment grounds reasoning steps in supporting evidence, mitigating hallucinations that can otherwise lead to confident yet incorrect conclusions.
  • Figure 4: Directional consistencies across seven reasoning scenarios. Each subplot shows the evolution of belief under two update methods: a logistic ODE solved via Runge--Kutta (blue circles) and a multiplicative Bayesian-style update (orange squares). Scenarios include consistently supportive, consistently contradictory, unverifiable, flipping, oscillating, gradual drift, and extreme certainty. The figure highlights differences in stability and volatility across reasoning dynamics.
  • Figure 5: Human evaluation questionnaire used to assess the quality of reasoning traces. Annotators rate each question on a 5-point Likert scale (1 = strongly disagree, 5 = strongly agree).
  • ...and 2 more figures