Table of Contents
Fetching ...

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng

TL;DR

The paper reframes retrieval-augmented generation by arguing that the usefulness of retrieved passages is not universal but depends on the specific LLM's internal knowledge and comprehension. It constructs gold utilitarian passages tailored to each LLM and introduces an LLM-specific utility judgment benchmark with set-based and ranking-based evaluations across six datasets and four LLMs. Key findings show human-annotated utility is suboptimal for many LLMs and that gold utilitarian passages are not transferable across models, while verbalized self-judgments with pseudo-answers are robust and attention proxies are unreliable. The work motivates personalized retrieval and utility assessment, offering guidance for designing LLM-aware RAG systems that can reject known queries and identify useful passages for unknown queries.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG's effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation

TL;DR

The paper reframes retrieval-augmented generation by arguing that the usefulness of retrieved passages is not universal but depends on the specific LLM's internal knowledge and comprehension. It constructs gold utilitarian passages tailored to each LLM and introduces an LLM-specific utility judgment benchmark with set-based and ranking-based evaluations across six datasets and four LLMs. Key findings show human-annotated utility is suboptimal for many LLMs and that gold utilitarian passages are not transferable across models, while verbalized self-judgments with pseudo-answers are robust and attention proxies are unreliable. The work motivates personalized retrieval and utility assessment, offering guidance for designing LLM-aware RAG systems that can reject known queries and identify useful passages for unknown queries.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG's effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.

Paper Structure

This paper contains 32 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Left: Answer generation performance ($has\_answer$, %) with the same top-20 retrieval results upon different LLMs. Right: RAG performance ($has\_answer$, %) of LLMs with gold utilitarian passages from different LLMs.
  • Figure 2: RAG performance (%) of LLMs with different gold utilitarian passages ($U$ candidate) from different LLMs.
  • Figure 3: Average number of overlapping passages between the gold utility ($U$ candidate) of LLM and human-annotated passages.
  • Figure 4: Up: The perplexity (PPL) of LLMs on human-annotated passages of the queries that the gold utilitarian passages ($U$ candidate) for LLMs are not empty. "H" and "G" mean the human-annotated passages and gold utilitarian passages for a specific LLM, respectively. Bottom: RAG performance ($has\_answer$, %) of LLMs with human-annotated passages on different queries. "Unk" means "Unknown". The definitions of "Known" and "Unknown" are shown in Section \ref{['sec:evaluation']}.
  • Figure 5: The prompt of pointwise and listwise self-utility judgment.
  • ...and 4 more figures