Table of Contents
Fetching ...

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Alexander Murphy, Mohd Sanad Zaki Rizvi, Aden Haussmann, Ping Nie, Guifu Liu, Aryo Pradipta Gema, Pasquale Minervini

TL;DR

This work tackles hallucination in LLMs for multi-hop open-domain QA by systematically analyzing how the ReAct agentic framework interacts with decoding-time faithful strategies (CAD, DoLa, DeCoRe). The study finds that integrating ReAct with faithful decoding generally improves $F1$ and $Answer\ Support\ Recall$ across HotpotQA, 2WikiMultihopQA, and MuSiQue, though gains are model-dependent and no single decoder dominates in all scenarios. It also reveals that improvements hinge on access to relevant retrieved context and that format adherence can be substantially enhanced, even in training-free settings, while limitations arise when context is absent. Overall, the results offer practical guidance for building faithful, knowledge-grounded LLM agents without fine-tuning, highlighting both the promise and boundaries of faithful decoding in RAG-based reasoning.

Abstract

Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

TL;DR

This work tackles hallucination in LLMs for multi-hop open-domain QA by systematically analyzing how the ReAct agentic framework interacts with decoding-time faithful strategies (CAD, DoLa, DeCoRe). The study finds that integrating ReAct with faithful decoding generally improves and across HotpotQA, 2WikiMultihopQA, and MuSiQue, though gains are model-dependent and no single decoder dominates in all scenarios. It also reveals that improvements hinge on access to relevant retrieved context and that format adherence can be substantially enhanced, even in training-free settings, while limitations arise when context is absent. Overall, the results offer practical guidance for building faithful, knowledge-grounded LLM agents without fine-tuning, highlighting both the promise and boundaries of faithful decoding in RAG-based reasoning.

Abstract

Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.

Paper Structure

This paper contains 28 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An example showing the difference between ReAct using standard vs. faithful decoding. The Thoughts, Actions and Observations depend on previous steps --- in the standard decoder, a mistake in Thought 2 propagates throughout the reasoning chain, yielding a wrong answer. The faithful decoder stays faithful to the observation, yielding the correct answer.
  • Figure 2: The proportion of questions for which a complete reasoning trace was created with ReAct (measured by how many have the Finish[...] keyword in their final output).
  • Figure 3: An example from 2WikiMultihopQA where faithful decoding improves upon standard decoding with Qwen2-7b-Instruct. Standard decoding is not faithful to the retrieved context and generates an incorrect thought, derailing the reasoning chain. CAD helps the agent stay faithful to context and leads it to the correct solution.