An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Alexander Murphy; Mohd Sanad Zaki Rizvi; Aden Haussmann; Ping Nie; Guifu Liu; Aryo Pradipta Gema; Pasquale Minervini

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Alexander Murphy, Mohd Sanad Zaki Rizvi, Aden Haussmann, Ping Nie, Guifu Liu, Aryo Pradipta Gema, Pasquale Minervini

TL;DR

This work tackles hallucination in LLMs for multi-hop open-domain QA by systematically analyzing how the ReAct agentic framework interacts with decoding-time faithful strategies (CAD, DoLa, DeCoRe). The study finds that integrating ReAct with faithful decoding generally improves $F1$ and $Answer\ Support\ Recall$ across HotpotQA, 2WikiMultihopQA, and MuSiQue, though gains are model-dependent and no single decoder dominates in all scenarios. It also reveals that improvements hinge on access to relevant retrieved context and that format adherence can be substantially enhanced, even in training-free settings, while limitations arise when context is absent. Overall, the results offer practical guidance for building faithful, knowledge-grounded LLM agents without fine-tuning, highlighting both the promise and boundaries of faithful decoding in RAG-based reasoning.

Abstract

Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

TL;DR

Abstract

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)