Table of Contents
Fetching ...

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, Hui Xiong

TL;DR

This work addresses the challenge of evaluating retrieval utility in retrieval-augmented generation (RAG) by introducing Semantic Perplexity (SePer), a sampling-based metric that tracks how retrieved information shifts an LLM's belief toward ground-truth answers. SePer estimates the LLM's semantic belief distribution through Monte-Carlo sampling and semantic clustering, and then computes the utility of retrieval as the change in this distribution, quantified as a semantic perplexity reduction. The paper provides theoretical grounding, validity and reliability analyses, and extensive experiments across QA and multi-hop tasks, showing strong alignment with human judgments and practical insights for RAG design (e.g., optimal numbers of retrieved items, prompt compression trade-offs, and reranker effects). Overall, SePer offers a principled, efficient, and generalizable framework to quantify retrieval utility, with meaningful implications for data curation, resource allocation, and system design in real-world RAG deployments.

Abstract

Large Language Models (LLMs) have demonstrated improved generation performance by incorporating externally retrieved knowledge, a process known as retrieval-augmented generation (RAG). Despite the potential of this approach, existing studies evaluate RAG effectiveness by 1) assessing retrieval and generation components jointly, which obscures retrieval's distinct contribution, or 2) examining retrievers using traditional metrics such as NDCG, which creates a gap in understanding retrieval's true utility in the overall generation process. To address the above limitations, in this work, we introduce an automatic evaluation method that measures retrieval quality through the lens of information gain within the RAG framework. Specifically, we propose Semantic Perplexity (SePer), a metric that captures the LLM's internal belief about the correctness of the retrieved information. We quantify the utility of retrieval by the extent to which it reduces semantic perplexity post-retrieval. Extensive experiments demonstrate that SePer not only aligns closely with human preferences but also offers a more precise and efficient evaluation of retrieval utility across diverse RAG scenarios.

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

TL;DR

This work addresses the challenge of evaluating retrieval utility in retrieval-augmented generation (RAG) by introducing Semantic Perplexity (SePer), a sampling-based metric that tracks how retrieved information shifts an LLM's belief toward ground-truth answers. SePer estimates the LLM's semantic belief distribution through Monte-Carlo sampling and semantic clustering, and then computes the utility of retrieval as the change in this distribution, quantified as a semantic perplexity reduction. The paper provides theoretical grounding, validity and reliability analyses, and extensive experiments across QA and multi-hop tasks, showing strong alignment with human judgments and practical insights for RAG design (e.g., optimal numbers of retrieved items, prompt compression trade-offs, and reranker effects). Overall, SePer offers a principled, efficient, and generalizable framework to quantify retrieval utility, with meaningful implications for data curation, resource allocation, and system design in real-world RAG deployments.

Abstract

Large Language Models (LLMs) have demonstrated improved generation performance by incorporating externally retrieved knowledge, a process known as retrieval-augmented generation (RAG). Despite the potential of this approach, existing studies evaluate RAG effectiveness by 1) assessing retrieval and generation components jointly, which obscures retrieval's distinct contribution, or 2) examining retrievers using traditional metrics such as NDCG, which creates a gap in understanding retrieval's true utility in the overall generation process. To address the above limitations, in this work, we introduce an automatic evaluation method that measures retrieval quality through the lens of information gain within the RAG framework. Specifically, we propose Semantic Perplexity (SePer), a metric that captures the LLM's internal belief about the correctness of the retrieved information. We quantify the utility of retrieval by the extent to which it reduces semantic perplexity post-retrieval. Extensive experiments demonstrate that SePer not only aligns closely with human preferences but also offers a more precise and efficient evaluation of retrieval utility across diverse RAG scenarios.

Paper Structure

This paper contains 40 sections, 15 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of retrieval utility in the multi-step RAG process. Unlike previous methods that only evaluate the final retrieval outcome, our approach can assess the utility of intermediate retrieval steps, even when the information retrieved is incomplete.
  • Figure 2: SePer: Estimating retrieval utility in multi-step retrieval-augmented generation (RAG) processes by measuring changes in model belief. SePer consists of four key steps: Probing the model’s belief through Monte-Carlo Sampling, where the LM generates $N$ responses to the query using a temperature parameter. Estimating the belief distribution over possible answers using semantic clustering. Calculating the model’s semantic perplexity by comparing the estimated belief distribution with the ground truth distribution. Assessing the unity of partial retrieval by measuring the change in semantic perplexity before and after retrieval.
  • Figure 3: Influence of the number of samples and repeated calculation of SePer on four datasets.
  • Figure 4: Results about applying SePer on different RAG settings. The and areas represent the positive and negative differences between SePer for generation w/ and w/o retrieval, respectively. The solid blue line indicates $\Delta$SePer, i.e., the utility of retrieval. The red dashed line indicates the zero point of the differences.
  • Figure 5: Differences in SePer across various retrieval and generation settings. Panel (a) illustrates the differences in SePer between generations w/ and w/o retrieval, analyzed under different retrieved items. Panel (b) highlights the effect of prompt compression methods on SePer differences compared to generation w/o retrieval. Panel (c) examines the impact of the reranker on SePer differences relative to generation w/o retrieval.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 1: Retrieval Utility
  • proof : Proof of Property \ref{['prop:dependence']}
  • proof : Proof of Property \ref{['prop:zero_point']}
  • proof : Proof of Property \ref{['prop:monotonicity']}
  • Definition 2: Semantic Equivalence