Table of Contents
Fetching ...

Towards A Unified View of Answer Calibration for Multi-Step Reasoning

Shumin Deng, Ningyu Zhang, Nay Oo, Bryan Hooi

TL;DR

This work tackles the lack of a systematic analysis of answer calibration for multi-step reasoning by proposing a unified view that separates step-level and path-level calibration and evaluates them across multiple reasoning paths. It introduces a quantitative framework with a unified score $\mathcal{D}_i = \alpha \frac{n_i}{N} + (1 - \alpha) \frac{m_i}{M}$ to balance path-consensus and intermediate-step correctness, and defines two dominant regimes via thresholds. Through extensive experiments on five arithmetic and commonsense tasks, the study finds that integrating step- and path-level strategies often yields the best performance, with path-level calibration typically providing stronger accuracy gains while step-level calibration improves robustness to prompting quality. The findings suggest practical guidelines for tuning answer calibration (notably the hyper-parameter $\alpha$) and imply that calibration can enhance accuracy, especially in zero-shot settings, while also affecting faithfulness and informativeness. Overall, the work offers a principled, unified framework to optimize multi-step reasoning via calibrated aggregation across multiple paths and steps, with implications for deploying CoT-based systems in diverse tasks.

Abstract

Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. We generally divide multi-step reasoning into two phases: path generation to generate the reasoning path(s); and answer calibration post-processing the reasoning path(s) to obtain a final answer. However, the existing literature lacks systematic analysis on different answer calibration approaches. In this paper, we summarize the taxonomy of recent answer calibration techniques and break them down into step-level and path-level strategies. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Experimental results reveal that integrating the dominance of both strategies tends to derive optimal outcomes. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.

Towards A Unified View of Answer Calibration for Multi-Step Reasoning

TL;DR

This work tackles the lack of a systematic analysis of answer calibration for multi-step reasoning by proposing a unified view that separates step-level and path-level calibration and evaluates them across multiple reasoning paths. It introduces a quantitative framework with a unified score to balance path-consensus and intermediate-step correctness, and defines two dominant regimes via thresholds. Through extensive experiments on five arithmetic and commonsense tasks, the study finds that integrating step- and path-level strategies often yields the best performance, with path-level calibration typically providing stronger accuracy gains while step-level calibration improves robustness to prompting quality. The findings suggest practical guidelines for tuning answer calibration (notably the hyper-parameter ) and imply that calibration can enhance accuracy, especially in zero-shot settings, while also affecting faithfulness and informativeness. Overall, the work offers a principled, unified framework to optimize multi-step reasoning via calibrated aggregation across multiple paths and steps, with implications for deploying CoT-based systems in diverse tasks.

Abstract

Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. We generally divide multi-step reasoning into two phases: path generation to generate the reasoning path(s); and answer calibration post-processing the reasoning path(s) to obtain a final answer. However, the existing literature lacks systematic analysis on different answer calibration approaches. In this paper, we summarize the taxonomy of recent answer calibration techniques and break them down into step-level and path-level strategies. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Experimental results reveal that integrating the dominance of both strategies tends to derive optimal outcomes. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.
Paper Structure (19 sections, 14 equations, 3 figures, 3 tables)

This paper contains 19 sections, 14 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of answer calibration for multi-step reasoning with LLM. The methods of step/path-level answer calibration for multiple paths can employ answer calibration on a single path first. (Terminology clarification of answer calibration and model calibration is elaborated in Appendix \ref{['sec:appendix_clarification_calibration']}.)
  • Figure 2: Accuracy under different integrated step-level and path-level answer calibration strategies, varying with the values of $\alpha$ defined in Eq \ref{['eq:measure_score']}. Performance with two thresholds of $\frac{1}{ \frac{M(N-2)}{N} + 1 }$ and $\frac{1}{ \frac{1}{N} + 1}$ are marked as $\bigstar$.
  • Figure 3: Performance (%) of "Accuracy, Faithfulness (Over Steps) and Informativeness (Over Path)" on SVAMP and MultiArith with different prompting on CoT models. We didn't show full results of other tasks for space limits.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2