Table of Contents
Fetching ...

Evaluating Step-by-step Reasoning Traces: A Survey

Jinu Lee, Julia Hockenmaier

TL;DR

<3-5 sentence high-level summary>This survey addresses the lack of a standardized framework for evaluating step-by-step reasoning traces produced by large language models. It introduces a universal taxonomy with four criteria—factuality, validity, coherence, and utility—and surveys a spectrum of evaluators, datasets, and meta-evaluation approaches that operationalize these criteria. The authors discuss trade-offs among rule-based, intrinsic, and external evaluators, and highlight techniques such as partial-context evaluation and test-time scaling to improve assessment reliability and efficiency. They outline promising directions, including symbol-grounded evaluation and rubric-based approaches for expert tasks, to extend applicability to long, real-world reasoning traces. Overall, the work provides a structured foundation to unify evaluation practices and guide future development of credible reasoning-trace evaluators.

Abstract

Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different datasets, evaluator implementations, and recent findings, leading to promising directions for future research.

Evaluating Step-by-step Reasoning Traces: A Survey

TL;DR

<3-5 sentence high-level summary>This survey addresses the lack of a standardized framework for evaluating step-by-step reasoning traces produced by large language models. It introduces a universal taxonomy with four criteria—factuality, validity, coherence, and utility—and surveys a spectrum of evaluators, datasets, and meta-evaluation approaches that operationalize these criteria. The authors discuss trade-offs among rule-based, intrinsic, and external evaluators, and highlight techniques such as partial-context evaluation and test-time scaling to improve assessment reliability and efficiency. They outline promising directions, including symbol-grounded evaluation and rubric-based approaches for expert tasks, to extend applicability to long, real-world reasoning traces. Overall, the work provides a structured foundation to unify evaluation practices and guide future development of credible reasoning-trace evaluators.

Abstract

Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different datasets, evaluator implementations, and recent findings, leading to promising directions for future research.

Paper Structure

This paper contains 67 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustrative example of reasoning trace evaluation.
  • Figure 2: Illustration of three popular meta-evaluation methods: meta-evaluation benchmarks, verifier-guided search, and reinforcement learning.
  • Figure 3: Illustration of the proposed categories of step-by-step reasoning evaluation criteria, i.e. factuality, validity, coherence, and utility. The left shows an example of a query and a reasoning trace. The other four blocks demonstrate examples that fail to satisfy the respective metric. Red filled rectangles indicate the error's location, and the outlined boxes and arrows show the cause of the error. The trace example is originally from DBLP:conf/iclr/LightmanKBEBLLS24.
  • Figure 4: Plot of different evaluators introduced in Section \ref{['sec:analysis']}, plotted by ProcessBench performance zheng2024processbenchidentifyingprocesserrors (GSM8k, MATH subsets averaged) versus total compute for evaluating a trace. While these evaluators share the same base model (Qwen-2.5-7B), they improve the base model's trace evaluation capability in different ways. Details can be found in Appendix \ref{['sec:appendix-analysis']}.
  • Figure 5: A Sankey diagram displaying the relationship between commonly used terminologies (left) to the proposed taxonomy (right).