Table of Contents
Fetching ...

ECLIPSE: Contrastive Dimension Importance Estimation with Pseudo-Irrelevance Feedback for Dense Retrieval

Giulio D'Erasmo, Giovanni Trappolini, Nicola Tonellotto, Fabrizio Silvestri

TL;DR

This work tackles the problem of noisy and non-discriminative dimensions in high-dimensional dense embeddings for information retrieval. It introduces Eclipse, a contrastive dimension importance estimator that utilizes both top (relevant) and bottom (irrelevant) retrieved documents to form sun and moon representations, yielding a residual dimension importance vector $u^{\text{Eclipse}}_q = \alpha (q \odot s) - \beta (q \odot m)$ (equivalently $u^{\text{Eclipse}}_q = \mathbf{q} \odot (\alpha \mathbf{s} - \beta \mathbf{m})$). The method can be plugged into existing DIMEs (PRF or LLM-based) and consistently improves retrieval performance across four benchmarks and three base models, with average AP gains up to $19.50\%$ (or $22.35\%$) and $\text{nDCG@10}$ gains up to $11.42\%$ (or $13.10\%$). Key findings show that highly irrelevant documents are valuable contrast signals, and semantic content of the irrelevant texts is less crucial than their low relevance, supporting a broader, pseudo-irrelevance-based approach for robust dense retrieval.

Abstract

Recent advances in Information Retrieval have leveraged high-dimensional embedding spaces to improve the retrieval of relevant documents. Moreover, the Manifold Clustering Hypothesis suggests that despite these high-dimensional representations, documents relevant to a query reside on a lower-dimensional, query-dependent manifold. While this hypothesis has inspired new retrieval methods, existing approaches still face challenges in effectively separating non-relevant information from relevant signals. We propose a novel methodology that addresses these limitations by leveraging information from both relevant and non-relevant documents. Our method, ECLIPSE, computes a centroid based on irrelevant documents as a reference to estimate noisy dimensions present in relevant ones, enhancing retrieval performance. Extensive experiments on three in-domain and one out-of-domain benchmarks demonstrate an average improvement of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10 w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our results pave the way for more robust, pseudo-irrelevance-based retrieval systems in future IR research.

ECLIPSE: Contrastive Dimension Importance Estimation with Pseudo-Irrelevance Feedback for Dense Retrieval

TL;DR

This work tackles the problem of noisy and non-discriminative dimensions in high-dimensional dense embeddings for information retrieval. It introduces Eclipse, a contrastive dimension importance estimator that utilizes both top (relevant) and bottom (irrelevant) retrieved documents to form sun and moon representations, yielding a residual dimension importance vector (equivalently ). The method can be plugged into existing DIMEs (PRF or LLM-based) and consistently improves retrieval performance across four benchmarks and three base models, with average AP gains up to (or ) and gains up to (or ). Key findings show that highly irrelevant documents are valuable contrast signals, and semantic content of the irrelevant texts is less crucial than their low relevance, supporting a broader, pseudo-irrelevance-based approach for robust dense retrieval.

Abstract

Recent advances in Information Retrieval have leveraged high-dimensional embedding spaces to improve the retrieval of relevant documents. Moreover, the Manifold Clustering Hypothesis suggests that despite these high-dimensional representations, documents relevant to a query reside on a lower-dimensional, query-dependent manifold. While this hypothesis has inspired new retrieval methods, existing approaches still face challenges in effectively separating non-relevant information from relevant signals. We propose a novel methodology that addresses these limitations by leveraging information from both relevant and non-relevant documents. Our method, ECLIPSE, computes a centroid based on irrelevant documents as a reference to estimate noisy dimensions present in relevant ones, enhancing retrieval performance. Extensive experiments on three in-domain and one out-of-domain benchmarks demonstrate an average improvement of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10 w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our results pave the way for more robust, pseudo-irrelevance-based retrieval systems in future IR research.

Paper Structure

This paper contains 6 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of retrieval results (Top Relevant document and First FPs documents) for the query "What is an active margin?" using PRF Eclipse and PRF Standard DIME methods with the model ANCE. The green colour indicates if the documents match the same topic of the query and the red otherwise. PRF Eclipse achieves a significantly higher AP. This improvement is attributed to Eclipse's ability to push topically relevant documents to the query higher in the ranking.
  • Figure 2: Performance comparison of PRF-Eclipse on DL '19 (a) and RB '04 (b) collections, showing AP as the percentage of retained dimensions increases. Different $k$ values represent the cardinality of retrieved document sets $\mathcal{D}_q$. Smaller $k$ (e.g. $k$=50) includes only highly relevant documents, while larger $k$ (e.g. $k=50,000$) gradually incorporates less relevant documents, affecting retrieval performance.