Table of Contents
Fetching ...

Beyond Perplexity: Let the Reader Select Retrieval Summaries via Spectrum Projection Score

Zhanghao Hu, Qinglin Zhu, Siya Qi, Yulan He, Hanqi Yan, Lin Gui

TL;DR

This work tackles the challenge of evaluating and improving retrieval-augmented generation by shifting from perplexity-based scoring to a representation-space alignment metric. It introduces Spectrum Projection Score (SPS), a training-free measure that quantifies how well a retrieved summary aligns with a reader model’s principal subspace by projecting a max-pooled bounder vector onto the reader’s PCA space, and pairs it with xCompress, an inference-time controller that selects and compresses retrieval summaries via SPS-guided sampling and adaptive filtering. Across five QA benchmarks and four open-source LLMs, SPS shows stronger correlation with downstream QA performance than traditional perplexity metrics and, when used in xCompress, yields notable gains in EM and F1 scores, highlighting the importance of semantic alignment over surface probability. The findings provide a principled, model-agnostic framework to diagnose and enhance retrieval–reader interactions in RAG systems, with practical implications for efficiently leveraging external knowledge in large language models.

Abstract

Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We move beyond perplexity and introduce Spectrum Projection Score (SPS), a lightweight and supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference-time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open-sourced LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.

Beyond Perplexity: Let the Reader Select Retrieval Summaries via Spectrum Projection Score

TL;DR

This work tackles the challenge of evaluating and improving retrieval-augmented generation by shifting from perplexity-based scoring to a representation-space alignment metric. It introduces Spectrum Projection Score (SPS), a training-free measure that quantifies how well a retrieved summary aligns with a reader model’s principal subspace by projecting a max-pooled bounder vector onto the reader’s PCA space, and pairs it with xCompress, an inference-time controller that selects and compresses retrieval summaries via SPS-guided sampling and adaptive filtering. Across five QA benchmarks and four open-source LLMs, SPS shows stronger correlation with downstream QA performance than traditional perplexity metrics and, when used in xCompress, yields notable gains in EM and F1 scores, highlighting the importance of semantic alignment over surface probability. The findings provide a principled, model-agnostic framework to diagnose and enhance retrieval–reader interactions in RAG systems, with practical implications for efficiently leveraging external knowledge in large language models.

Abstract

Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We move beyond perplexity and introduce Spectrum Projection Score (SPS), a lightweight and supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference-time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open-sourced LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.

Paper Structure

This paper contains 36 sections, 3 theorems, 14 equations, 5 figures, 4 tables.

Key Result

Theorem 1

Given a sequence $\mathbf{x} = (\mathbf{x}_1, \ldots, \mathbf{x}_m)$, for any subsequence of $\mathbf{x}$, denoted as $\mathbf{x}_{sub}$, it always holds that $\mathbf{x}_{sub} \preceq \mathbf{x}$.

Figures (5)

  • Figure 1: Token selection in the reader’s embedding space. We project summary's token embeddings with t-SNE and compare three selections: nearest to the mean-pooled vector, highest predictive probability, and contributors to max pooling. Mean pooling and perplexity concentrate near the center and favour syntactically frequent tokens. Max pooling emphasises boundary tokens near the convex hull that carry salient semantics.
  • Figure 2: RAG task performances (measured by EM and F1) when feeding summaries with varying PPL (left) and LongPPL (right) to the Reader on the HotpotQA dataset. The low Pearson correlation coefficients ($r$) indicate that both PPL and LongPPL fail to identify a good summary.
  • Figure 3: Overview of the xCompress framework. Retrieved passages are first compressed into summaries. An adaptive norm-guided filtering mechanism determines whether additional test-time sampling is necessary. If required, multiple summaries are sampled from the compressor LLM and evaluated using the Spectrum Projection Score (SPS). These summaries are first embedded via max-pooling, then projected onto the reader’s principal subspace of its parameter. The summary with the lowest SPS is selected as input to the reader; otherwise, the initial summary is used directly for answer generation.
  • Figure 4: SPS performance under (a) Across LLM layers. (b) Varying PCA retained variance ratios. Optimal results are achieved using embeddings from the penultimate layer and a PCA variance ratio of 0.95.
  • Figure 5: Impact of the number of generated summaries on EM and F1 scores TrivialQA. Performance saturates at five summaries, providing an optimal balance between effectiveness and computational efficiency.

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof