Table of Contents
Fetching ...

Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

Jingtian Wu, Claire Cardie

TL;DR

This work targets hallucinations and reasoning errors in multi-hop LLM tasks by introducing Reasoning Court (RC), a framework that couples retrieval-grounded reasoning with an independent LLM judge. RC uses two concurrent agents to produce candidate trajectories, which the judge evaluates for factual grounding and logical coherence, and may synthesize a final answer when needed. Across HotpotQA, FEVER, and MuSiQue, RC outperforms strong few-shot baselines in EM and F1, with improved efficiency due to reduced LLM calls. The approach enhances interpretability and reliability of LLM-based reasoning and has potential applicability to broader reasoning tasks beyond standard benchmarks.

Abstract

While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.

Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

TL;DR

This work targets hallucinations and reasoning errors in multi-hop LLM tasks by introducing Reasoning Court (RC), a framework that couples retrieval-grounded reasoning with an independent LLM judge. RC uses two concurrent agents to produce candidate trajectories, which the judge evaluates for factual grounding and logical coherence, and may synthesize a final answer when needed. Across HotpotQA, FEVER, and MuSiQue, RC outperforms strong few-shot baselines in EM and F1, with improved efficiency due to reduced LLM calls. The approach enhances interpretability and reliability of LLM-based reasoning and has potential applicability to broader reasoning tasks beyond standard benchmarks.

Abstract

While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.

Paper Structure

This paper contains 40 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of RC, ReAct, and CoT methods in answering a HotpotQA yang2018hotpotqadatasetdiverseexplainable question. The reasoning and acting stages are labeled as "Thoughts" and "Actions," respectively. Evidence, containing information retrieved from Wikipedia, is presented in "Observations." The final answer provided by the agent is shown in "Final Answer." Red highlights indicate incorrect reasoning or decisions made by the LLM agent, whereas green highlights represent correct reasoning or decisions.
  • Figure 2: Impact of increasing the number of agents on EM and F1 scores across HotpotQA, FEVER, and MuSiQue. RC represents two agents with LLM temperature set to 0, while RC-3, RC-4, and RC-5 represent 3, 4, and 5 agents respectively, using an LLM temperature of 0.7 to induce diversity in reasoning paths.
  • Figure 3: Example from the ReAct framework using the Llama-3.2-11B-text-preview model on a question from the FEVER dataset.
  • Figure 4: An example in FEVER where RC correctly identifies the correct answer in the "one correct, one incorrect" scenario.
  • Figure 5: An example in FEVER where RC correctly synthesizes the correct answer in the "both incorrect or empty" scenario.
  • ...and 2 more figures