Table of Contents
Fetching ...

A Distributed Collaborative Retrieval Framework Excelling in All Queries and Corpora based on Zero-shot Rank-Oriented Automatic Evaluation

Tian-Yi Che, Xian-Ling Mao, Chun Xu, Cheng-Xin Xin, Heng-Da Xu, Jin-Yu Liu, Heyan Huang

TL;DR

The paper addresses the fragmentation of retrieval performance across queries and corpora by proposing a Distributed Collaborative Retrieval Framework (DCRF) that unifies sparse, dense, and LLM-based retrievers. It introduces rank-oriented, zero-shot evaluation via four prompting strategies to select the best ranked results without labeled data, enabling flexible, scalable integration of models. Through extensive experiments on BEIR and TREC datasets with multiple open-source and black-box LLMs, DCRF achieves competitive or superior performance and improved efficiency compared to existing methods like RankGPT and ListT5. The framework offers practical impact by reducing maintenance costs, enabling domain adaptation, and providing a robust baseline for future multi-model retrieval systems and rank-aware evaluations.

Abstract

Numerous retrieval models, including sparse, dense and llm-based methods, have demonstrated remarkable performance in predicting the relevance between queries and corpora. However, the preliminary effectiveness analysis experiments indicate that these models fail to achieve satisfactory performance on the majority of queries and corpora, revealing their effectiveness restricted to specific scenarios. Thus, to tackle this problem, we propose a novel Distributed Collaborative Retrieval Framework (DCRF), outperforming each single model across all queries and corpora. Specifically, the framework integrates various retrieval models into a unified system and dynamically selects the optimal results for each user's query. It can easily aggregate any retrieval model and expand to any application scenarios, illustrating its flexibility and scalability.Moreover, to reduce maintenance and training costs, we design four effective prompting strategies with large language models (LLMs) to evaluate the quality of ranks without reliance of labeled data. Extensive experiments demonstrate that proposed framework, combined with 8 efficient retrieval models, can achieve performance comparable to effective listwise methods like RankGPT and ListT5, while offering superior efficiency. Besides, DCRF surpasses all selected retrieval models on the most datasets, indicating the effectiveness of our prompting strategies on rank-oriented automatic evaluation.

A Distributed Collaborative Retrieval Framework Excelling in All Queries and Corpora based on Zero-shot Rank-Oriented Automatic Evaluation

TL;DR

The paper addresses the fragmentation of retrieval performance across queries and corpora by proposing a Distributed Collaborative Retrieval Framework (DCRF) that unifies sparse, dense, and LLM-based retrievers. It introduces rank-oriented, zero-shot evaluation via four prompting strategies to select the best ranked results without labeled data, enabling flexible, scalable integration of models. Through extensive experiments on BEIR and TREC datasets with multiple open-source and black-box LLMs, DCRF achieves competitive or superior performance and improved efficiency compared to existing methods like RankGPT and ListT5. The framework offers practical impact by reducing maintenance costs, enabling domain adaptation, and providing a robust baseline for future multi-model retrieval systems and rank-aware evaluations.

Abstract

Numerous retrieval models, including sparse, dense and llm-based methods, have demonstrated remarkable performance in predicting the relevance between queries and corpora. However, the preliminary effectiveness analysis experiments indicate that these models fail to achieve satisfactory performance on the majority of queries and corpora, revealing their effectiveness restricted to specific scenarios. Thus, to tackle this problem, we propose a novel Distributed Collaborative Retrieval Framework (DCRF), outperforming each single model across all queries and corpora. Specifically, the framework integrates various retrieval models into a unified system and dynamically selects the optimal results for each user's query. It can easily aggregate any retrieval model and expand to any application scenarios, illustrating its flexibility and scalability.Moreover, to reduce maintenance and training costs, we design four effective prompting strategies with large language models (LLMs) to evaluate the quality of ranks without reliance of labeled data. Extensive experiments demonstrate that proposed framework, combined with 8 efficient retrieval models, can achieve performance comparable to effective listwise methods like RankGPT and ListT5, while offering superior efficiency. Besides, DCRF surpasses all selected retrieval models on the most datasets, indicating the effectiveness of our prompting strategies on rank-oriented automatic evaluation.

Paper Structure

This paper contains 34 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The frequency of retrieval models on the benchmark TREC and BEIR. The frequency means the proportion of queries on which a specific model generates the best results among all selected models. The % of values is omitted and the maximum displayed value is 25% for legibility. A region denotes the capacity of a retrieval model across all the queires and corpora and the most ideal shape is the ten-sided polygon.
  • Figure 2: The comparison of the traditional IR framework and DCRF. (a) A retriever is used to recall top-100 relevant passages and a reranker reorders them based on their relevancy to the query. (b) The retriever is the same and then multiple rerankers reorder the recalled passages respectively as candidates. Finally, the evaluator picks up the best rank from the candidates based on the query.
  • Figure 3: Four types of prompting methods for zero-shot rank-oriented automatic evaluation with LLMs. The gray and yellow blocks indicate the inputs and outputs of the model. (a) instructs LLMs to score the relevancy between the query and passage. (b) instructs LLMs to output relevance assessment. (c) rates an overall score for each rank. (d) directly selects the assistant with better ranks. Complete prompts in Appendix A.
  • Figure 4: Comparison for ideal scenarios on TREC and BEIR. the Best base model in the legend indicates the best performance among all selected retrieval models. Best DCRF indicates the performance of the framework with NDCG@10 as evaluation, which relies on the labels annotated by humans.
  • Figure 5: Statistics of selected retrieval models as the best model on Dbpedia-Entity. The order of retrieval models in the legend and evaluation is consistent. The passage-pointwise is selected as the prompting method.
  • ...and 1 more figures