Table of Contents
Fetching ...

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Vaibhav Mavi, Shubh Jaroria, Weiqi Sun

TL;DR

The paper tackles the reliability of LLMs in multi-step reasoning by formalizing failure detection and comparing holistic versus stepwise confidence estimation. It demonstrates that step-level scoring, especially when using a regression-based confidencer, more reliably detects errors across GSM8K and CoQA, with corroborating results on a private clinical dataset. The study shows that stepwise evaluation often outperforms holistic evaluation, supporting more fine-grained failure detection and contributing a practical framework for trustworthy multi-step reasoning in LLM systems.

Abstract

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

TL;DR

The paper tackles the reliability of LLMs in multi-step reasoning by formalizing failure detection and comparing holistic versus stepwise confidence estimation. It demonstrates that step-level scoring, especially when using a regression-based confidencer, more reliably detects errors across GSM8K and CoQA, with corroborating results on a private clinical dataset. The study shows that stepwise evaluation often outperforms holistic evaluation, supporting more fine-grained failure detection and contributing a practical framework for trustworthy multi-step reasoning in LLM systems.

Abstract

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Paper Structure

This paper contains 21 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Case from GSM8K where the agent gets the answer correct through incorrect reasoning steps. The agent assumes the current ages of Mico and Marco to be $5$ and $15$ while the question does not mention it. The agent ends up getting to the correct answer nonetheless since it only concerns with the sum of their ages.
  • Figure 2: An example from the private clinical data.
  • Figure :