Table of Contents
Fetching ...

Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation

Tianyu Liu, Jirui Qi, Paul He, Arianna Bisazza, Mrinmaya Sachan, Ryan Cotterell

TL;DR

This work investigates how the order of retrieved documents in retrieval-augmented generation affects QA performance. It introduces Pointwise Mutual Information between a question and the context as an answer-agnostic gauge, showing strong corpus- and instance-level correlations with accuracy on NQ-Open and ELI5. Two practical prompt-ordering strategies are proposed: (i) selecting the permutation that maximizes PMI and (ii) a curvature-based method grounded in discrete convexity to induce a U-shaped PMI curve. Empirical results across multiple open LMs demonstrate performance gains, with efficiency advantages from avoiding LM decoding during permutation selection. The study also discusses model-tuning effects and realistic limitations, highlighting PMI-based prompt optimization as a promising direction for improving RAG systems in practice.

Abstract

Recent work suggests that large language models enhanced with retrieval-augmented generation are easily influenced by the order, in which the retrieved documents are presented to the model when solving tasks such as question answering (QA). However, there is no method to date that exploits this phenomenon to improve generation. We fill this gap. In this study, we show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. Importantly, this gauge does not depend on knowing the answer to the question a priori. Through experiments on two question-answering datasets and a variety of large language models, we find evidence for an empirical correlation between answer accuracy and pointwise mutual information. Additionally, we propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance, whose effectiveness we demonstrate through experimentation.

Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation

TL;DR

This work investigates how the order of retrieved documents in retrieval-augmented generation affects QA performance. It introduces Pointwise Mutual Information between a question and the context as an answer-agnostic gauge, showing strong corpus- and instance-level correlations with accuracy on NQ-Open and ELI5. Two practical prompt-ordering strategies are proposed: (i) selecting the permutation that maximizes PMI and (ii) a curvature-based method grounded in discrete convexity to induce a U-shaped PMI curve. Empirical results across multiple open LMs demonstrate performance gains, with efficiency advantages from avoiding LM decoding during permutation selection. The study also discusses model-tuning effects and realistic limitations, highlighting PMI-based prompt optimization as a promising direction for improving RAG systems in practice.

Abstract

Recent work suggests that large language models enhanced with retrieval-augmented generation are easily influenced by the order, in which the retrieved documents are presented to the model when solving tasks such as question answering (QA). However, there is no method to date that exploits this phenomenon to improve generation. We fill this gap. In this study, we show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. Importantly, this gauge does not depend on knowing the answer to the question a priori. Through experiments on two question-answering datasets and a variety of large language models, we find evidence for an empirical correlation between answer accuracy and pointwise mutual information. Additionally, we propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance, whose effectiveness we demonstrate through experimentation.

Paper Structure

This paper contains 46 sections, 2 theorems, 19 equations, 14 figures, 7 tables.

Key Result

Proposition 2.1

Under assumptions given in ass:assumption, we have for an answer-dependent constant $C({\color{MyTawny} \boldsymbol{a}}, {\color{MyGreen} {\boldsymbol{c}\xspace}_{{\color{MyGreen} {\mathcal{D}}}}({\color{MyOrange} \pi})})$.

Figures (14)

  • Figure 1: For the same question, a permutation of documents with a higher ${\textnormal{PMI}}({\color{MyBlue} {\boldsymbol{q}\xspace}}, {\color{MyGreen} {\boldsymbol{c}\xspace}_{{\color{MyGreen} {\mathcal{D}}}}({\color{MyOrange} \pi})})$ tends to lead to a better answer.
  • Figure 2: We observe that the PMI and QA accuracy trace a U-shaped curve---nearly in lockstep---as the gold document position within the context changes. The result is computed with LLaMA-3-8B.
  • Figure 3: Corpus-level correlation between ${\textnormal{PMI}}({\color{MyBlue} {\boldsymbol{q}\xspace}}, {\color{MyGreen} {\boldsymbol{c}\xspace}_{{\color{MyGreen} {\mathcal{D}}}}({\color{MyOrange} \pi})})$ and answer accuracy on NQ-Open and ELI5.
  • Figure 4: QA accuracy, PMI, and log odds ratio of answer likelihood on 20 docs evaluated on LLaMA-3.1-8B and LLaMA-3.1-8B-Instruct.
  • Figure 5: When the position of the gold document changes, both ${\textnormal{PMI}}({\color{MyBlue} {\boldsymbol{q}\xspace}}, {\color{MyGreen} {\boldsymbol{c}\xspace}})$ and accuracy curves are U-shaped. In contrast, both curves are flat for non-gold (denoted by random) documents.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Example 2.1
  • Proposition 2.1
  • proof
  • Example 3.1
  • Proposition G.1
  • proof