Table of Contents
Fetching ...

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

Abstract

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Abstract

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

Paper Structure

This paper contains 34 sections, 23 figures, 9 tables.

Figures (23)

  • Figure 1: Diagram of SAFE at inference-time. The Generator produces or refines an atomic reasoning step, while the Evaluator detects errors and provides feedback. This step-level feedback loop continues until the reasoning terminates.
  • Figure 2: Cost-accuracy comparison between SAFE and the Self-Feedback baseline. We evaluate the inference efficiency across different max_steps ($K\in\{7, 10, 13\}$) and max_retries ($N\in\{1..5\}$). While the Self-Feedback exhibits a severe cost explosion without meaningful accuracy improvements, SAFE achieves superior accuracy with minimal computational overhead.
  • Figure 3: Distribution of Error Types: Relative proportion of reasoning errors categorized by the SAFE taxonomy. Procedural errors (e.g., Redundancy and Off-topic) account for the majority (64.1%) of failures, highlighting that path-finding is the primary bottleneck in multi-hop reasoning.
  • Figure 4: Step-wise Error Distribution: Frequency of error occurrences across reasoning steps. A significant concentration of failures (59.0%) is observed within the initial three steps, justifying the necessity of early-stage, inference-time verification to prevent error propagation.
  • Figure 5: Overview of our KG-grounded verification pipeline, illustrated with a real example from the benchmark. In this example, the temperature fact originates from a passage about North Carolina's Piedmont, not Virginia's, revealing a missing connection in the reasoning path. Our pipeline identifies such gaps and filters out the question accordingly.
  • ...and 18 more figures