Table of Contents
Fetching ...

Leveraging LLMs for Unsupervised Dense Retriever Ranking

Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Guido Zuccon

TL;DR

The paper tackles the problem of selecting and ranking dense retrievers for a target corpus under domain shift without access to target queries or relevance labels. It introduces LARMOR, which uses large language models to generate pseudo-queries, pseudo-relevance judgments, and pseudo-reference lists from a sampled portion of the target corpus, then fuses signals to rank a pool of dense retrievers. Evaluated across 13 BEIR corpora with a pool of 47 DRs, LARMOR consistently outperforms prior DR selection baselines and approaches Oracle-level performance, all without relying on training data from the target domain. The work demonstrates the practicality of zero-shot, target-corpus–driven DR selection and suggests avenues for improving prompts and reducing LLM-related costs in real deployments.

Abstract

In this paper we present Large Language Model Assisted Retrieval Model Ranking (LARMOR), an effective unsupervised approach that leverages LLMs for selecting which dense retriever to use on a test corpus (target). Dense retriever selection is crucial for many IR applications that rely on using dense retrievers trained on public corpora to encode or search a new, private target corpus. This is because when confronted with domain shift, where the downstream corpora, domains, or tasks of the target corpus differ from the domain/task the dense retriever was trained on, its performance often drops. Furthermore, when the target corpus is unlabeled, e.g., in a zero-shot scenario, the direct evaluation of the model on the target corpus becomes unfeasible. Unsupervised selection of the most effective pre-trained dense retriever becomes then a crucial challenge. Current methods for dense retriever selection are insufficient in handling scenarios with domain shift. Our proposed solution leverages LLMs to generate pseudo-relevant queries, labels and reference lists based on a set of documents sampled from the target corpus. Dense retrievers are then ranked based on their effectiveness on these generated pseudo-relevant signals. Notably, our method is the first approach that relies solely on the target corpus, eliminating the need for both training corpora and test labels. To evaluate the effectiveness of our method, we construct a large pool of state-of-the-art dense retrievers. The proposed approach outperforms existing baselines with respect to both dense retriever selection and ranking. We make our code and results publicly available at https://github.com/ielab/larmor/.

Leveraging LLMs for Unsupervised Dense Retriever Ranking

TL;DR

The paper tackles the problem of selecting and ranking dense retrievers for a target corpus under domain shift without access to target queries or relevance labels. It introduces LARMOR, which uses large language models to generate pseudo-queries, pseudo-relevance judgments, and pseudo-reference lists from a sampled portion of the target corpus, then fuses signals to rank a pool of dense retrievers. Evaluated across 13 BEIR corpora with a pool of 47 DRs, LARMOR consistently outperforms prior DR selection baselines and approaches Oracle-level performance, all without relying on training data from the target domain. The work demonstrates the practicality of zero-shot, target-corpus–driven DR selection and suggests avenues for improving prompts and reducing LLM-related costs in real deployments.

Abstract

In this paper we present Large Language Model Assisted Retrieval Model Ranking (LARMOR), an effective unsupervised approach that leverages LLMs for selecting which dense retriever to use on a test corpus (target). Dense retriever selection is crucial for many IR applications that rely on using dense retrievers trained on public corpora to encode or search a new, private target corpus. This is because when confronted with domain shift, where the downstream corpora, domains, or tasks of the target corpus differ from the domain/task the dense retriever was trained on, its performance often drops. Furthermore, when the target corpus is unlabeled, e.g., in a zero-shot scenario, the direct evaluation of the model on the target corpus becomes unfeasible. Unsupervised selection of the most effective pre-trained dense retriever becomes then a crucial challenge. Current methods for dense retriever selection are insufficient in handling scenarios with domain shift. Our proposed solution leverages LLMs to generate pseudo-relevant queries, labels and reference lists based on a set of documents sampled from the target corpus. Dense retrievers are then ranked based on their effectiveness on these generated pseudo-relevant signals. Notably, our method is the first approach that relies solely on the target corpus, eliminating the need for both training corpora and test labels. To evaluate the effectiveness of our method, we construct a large pool of state-of-the-art dense retrievers. The proposed approach outperforms existing baselines with respect to both dense retriever selection and ranking. We make our code and results publicly available at https://github.com/ielab/larmor/.
Paper Structure (24 sections, 4 equations, 4 figures, 5 tables)

This paper contains 24 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: nDCG@10 Across Collections. This figure illustrates that selecting the best model based on one collection (as indicated by orange dots) does not necessarily ensure its effectiveness on another. In contrast, our unsupervised approach (indicated by red squares) consistently selects competitive models across various collections, and even identifies the most performant model for Trec-News.
  • Figure 2: The LARMOR dense retriever selection pipeline. Labels Q, QF, QFJ and QFR refer to the ablation points described in Section \ref{['sec:ablation']}.
  • Figure 3: Kendall Tau (left) and $\Delta_e$ (right) performance of our proposed LARMOR and its various components using different sizes of FlanT5 models.
  • Figure 4: Kendall Tau (left) and $\Delta_e$ (right) performance of pseudo-query generation with a different number of generated queries and different LLM backbone.