Table of Contents
Fetching ...

Retrieval-augmented Decoding for Improving Truthfulness in Open-ended Generation

Manh Nguyen, Sunil Gupta, Hung Le

Abstract

Ensuring truthfulness in large language models (LLMs) remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require a substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose Retrieval-Augmented Decoding (RAD), a context-aware adaptive decoding method that leverages a compact reference grounding space built from as few as 10 annotated examples and comprising pairs of context embeddings and next-token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, RAD retrieves high-quality semantically similar contexts from this grounding space and aggregates their associated next token logits to modify the model's current logits. Across four open-ended generation benchmarks and four LLMs, our method consistently outperforms strong baselines and shows robust cross-task generalization, underscoring the promise of context-aware decoding for enhancing factual reliability.

Retrieval-augmented Decoding for Improving Truthfulness in Open-ended Generation

Abstract

Ensuring truthfulness in large language models (LLMs) remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require a substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose Retrieval-Augmented Decoding (RAD), a context-aware adaptive decoding method that leverages a compact reference grounding space built from as few as 10 annotated examples and comprising pairs of context embeddings and next-token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, RAD retrieves high-quality semantically similar contexts from this grounding space and aggregates their associated next token logits to modify the model's current logits. Across four open-ended generation benchmarks and four LLMs, our method consistently outperforms strong baselines and shows robust cross-task generalization, underscoring the promise of context-aware decoding for enhancing factual reliability.

Paper Structure

This paper contains 36 sections, 9 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of Retrieval-Augmented Decoding (RAD). The example in this figure is from WikiQA dataset yang2015wikiqa. The original response (Greedy Decoding) contains incorrect information. In contrast, RAD generates a more relevant response by refining its next-token predictions based on logit signals, retrieved from a precomputed grounding space. More details are provided in Figure \ref{['fig:RAD_method']} below.
  • Figure 2: Illustration of RAD and Greedy decoding at decoding step 2 (after generating the first token in the answer, " The". At each decoding step, the most recent chunk of $M$ tokens ($M=8$ in this figure) is used to query the precomputed grounding space $C$. (a) Relevant context–logit pairs are retrieved from $C$ based on cosine similarity between the current chunk and all stored contexts (contains the same $M$ tokens), followed by threshold filtering using $\tau$. Yellow boxes highlight retrieved relevant contexts while gray boxes are not. (b) Retrieved logits are aggregated via cosine-weighted averaging. (c) The aggregated logits are fused with the base (Greedy) logits using interpolation weight $\alpha$. RAD adjusts the logit distribution precisely in uncertain regions, where several candidates (e.g., "_Yet", "_co", "_ok", where "_" denotes a space) have similar scores—nudging the model toward the correct continuation (token "_co"). These divergences lead to drastically different final answers: Greedy chooses "_Yet" $\to$ "The Yeti" (incorrect), whereas RAD selects "_co" $\to$ "The coelacanth" (correct). Additional qualitative case studies appear in Appendix \ref{['app:qualitative']}.
  • Figure 3: Performance on TruthfulQA across grounding space sizes $|C|$.
  • Figure 4: TruthfulQA performance on Qwen2.5-7B for chunk size $M$ (a), retrieval threshold $\tau$ (b), and interpolation weight $\alpha$ (c). $M{=}\textit{full}$ uses all preceding tokens for context chunks, while $\tau{=}1$ or $\alpha{=}0$ corresponds to the Greedy baseline (RAD without retrieved contexts). Extended results are provided in Appendix \ref{['app:params-extend']}.
  • Figure 5: RAD's retrieval and construction times across datasets on Qwen2.5-7B. Additional statistics of the grounding space are provided in Table \ref{['tab:space-stats']}, Appendix \ref{['app:space-stats']}.
  • ...and 3 more figures