Uncovering the Limitations of Query Performance Prediction: Failures, Insights, and Implications for Selective Query Processing
Adrian-Gabriel Chifu, Sébastien Déjean, Josiane Mothe, Moncef Garouani, Diego Ortiz, Md Zia Ullah
TL;DR
This paper tackles the generalization problem of Query Performance Prediction (QPP) across sparse and dense IR paradigms and diverse collections. It performs a comprehensive cross-paradigm evaluation of state-of-the-art QPPs (e.g., NQC, UQC), LETOR-derived features, and dense-based predictors (BERT-based) across sparse rankers like $BM25$ and $DFree$, with and without query expansion, and dense rankers SPLADE and ColBERT on ROBUST, GOV2, WT10G, and MS MARCO. Using two-fold query-level cross-validation and standard metrics ($NDCG$, $MAP$, $P@10$, $MRR@10$) plus correlation and regression error measures, the study reveals large variability in predictor accuracy and a strong influence of the collection, with limited cross-collection generalization and modest downstream gains for selective processing. The results highlight the fragility of current QPP approaches, especially in dense retrieval contexts, and call for new predictors that generalize across collections and align with dense architectures to be practically useful for downstream tasks like selective ranking and QE decisions.
Abstract
Query Performance Prediction (QPP) estimates retrieval systems effectiveness for a given query, offering valuable insights for search effectiveness and query processing. Despite extensive research, QPPs face critical challenges in generalizing across diverse retrieval paradigms and collections. This paper provides a comprehensive evaluation of state-of-the-art QPPs (e.g. NQC, UQC), LETOR-based features, and newly explored dense-based predictors. Using diverse sparse rankers (BM25, DFree without and with query expansion) and hybrid or dense (SPLADE and ColBert) rankers and diverse test collections ROBUST, GOV2, WT10G, and MS MARCO; we investigate the relationships between predicted and actual performance, with a focus on generalization and robustness. Results show significant variability in predictors accuracy, with collections as the main factor and rankers next. Some sparse predictors perform somehow on some collections (TREC ROBUST and GOV2) but do not generalise to other collections (WT10G and MS-MARCO). While some predictors show promise in specific scenarios, their overall limitations constrain their utility for applications. We show that QPP-driven selective query processing offers only marginal gains, emphasizing the need for improved predictors that generalize across collections, align with dense retrieval architectures and are useful for downstream applications.
