LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
TL;DR
The paper reframes retrieval-augmented generation by arguing that the usefulness of retrieved passages is not universal but depends on the specific LLM's internal knowledge and comprehension. It constructs gold utilitarian passages tailored to each LLM and introduces an LLM-specific utility judgment benchmark with set-based and ranking-based evaluations across six datasets and four LLMs. Key findings show human-annotated utility is suboptimal for many LLMs and that gold utilitarian passages are not transferable across models, while verbalized self-judgments with pseudo-answers are robust and attention proxies are unreliable. The work motivates personalized retrieval and utility assessment, offering guidance for designing LLM-aware RAG systems that can reject known queries and identify useful passages for unknown queries.
Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG's effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.
