Table of Contents
Fetching ...

A Systematic Study of Pseudo-Relevance Feedback with LLMs

Nour Jedidi, Jimmy Lin

TL;DR

Across 13 low-resource BEIR tasks with five LLM PRF methods, the results show that the choice of feedback model can play a critical role in PRF effectiveness; feedback derived solely from LLM-generated text provides the most cost-effective solution; and feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever.

Abstract

Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.

A Systematic Study of Pseudo-Relevance Feedback with LLMs

TL;DR

Across 13 low-resource BEIR tasks with five LLM PRF methods, the results show that the choice of feedback model can play a critical role in PRF effectiveness; feedback derived solely from LLM-generated text provides the most cost-effective solution; and feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever.

Abstract

Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.
Paper Structure (20 sections, 5 equations, 4 figures, 5 tables)

This paper contains 20 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Demonstration of feedback source and feedback model. Feedback can come from (1) an LLM, (2) the corpus, or both. Feedback documents then get passed to a feedback model to update the query. (a) Weighted average vector, (b) average vector, and (c) string concatenation with a query repeat are different feedback models used in the literature.
  • Figure 2: Overview of different PRF pipelines. Dotted boxes denote optional steps. For example, if not passing blue documents to UMBRELA, $d^2$ gets fed into the feedback model. Running both PRF pipelines in parallel is equivalent to Umbrela-HyDE which concatenates both sets of feedback documents (yellow and green) before passing them to the feedback model.
  • Figure 3: Effectiveness of LLM-PRF methods across different number of top-$k$ candidate documents from BM25. Umbrela-HyDE and PRF-HyDE are explored in Section \ref{['sec:combining_feedback_tests']}.
  • Figure 4: Latency of LLM PRF methods relative to HyDE.