Table of Contents
Fetching ...

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney

TL;DR

LexBoost addresses the gap between fast lexical retrieval and semantically rich dense retrieval by offline-building a corpus graph of dense neighbors and online enriching lexical scores with neighbors' lexical signals. The method uses a convex fusion of a document's lexical score and the mean lexical scores of its neighbors, formalized as $LexBoost(q,d,D) = \lambda \cdot \text{score}(q,d) + \frac{1-\lambda}{n} \cdot \sum_{d_{neigh} \in N} \text{score}(q,d_{neigh})$, where $\lambda \in [0,1]$ and $n$ is the neighbor count. Across MS MARCO DL 19/20 and CORD-19/TREC-COVID, LexBoost delivers statistically significant improvements over strong lexical baselines (e.g., BM25, PL2, DPH, QLD) with negligible query-time cost and remains robust to hyperparameters and dataset choice. The findings highlight the practical impact of incorporating dense-neighborhood information via a precomputed corpus graph, enabling strong effectiveness gains without sacrificing latency. Potential extensions include cross-lingual and user-history enriched retrieval scenarios, as well as dynamic graph construction and adaptive fusion parameter tuning.

Abstract

Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods approach problems differently and use complementary relevance signals, approximation methods were proposed to balance effectiveness and efficiency. For efficiency, approximation methods like HNSW are frequently used to approximate exhaustive dense retrieval. However, approximation techniques still exhibit considerably higher latency than sparse approaches. We propose LexBoost that first builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing. Then, during retrieval, we consider both a document's lexical relevance scores and its neighbors' scores to rank the documents. In LexBoost this remarkably simple application of the Cluster Hypothesis contributes to stronger ranking effectiveness while contributing little computational overhead (since the corpus graph is constructed offline). The method is robust across the number of neighbors considered, various fusion parameters for determining the scores, and different dataset construction methods. We also show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

TL;DR

LexBoost addresses the gap between fast lexical retrieval and semantically rich dense retrieval by offline-building a corpus graph of dense neighbors and online enriching lexical scores with neighbors' lexical signals. The method uses a convex fusion of a document's lexical score and the mean lexical scores of its neighbors, formalized as , where and is the neighbor count. Across MS MARCO DL 19/20 and CORD-19/TREC-COVID, LexBoost delivers statistically significant improvements over strong lexical baselines (e.g., BM25, PL2, DPH, QLD) with negligible query-time cost and remains robust to hyperparameters and dataset choice. The findings highlight the practical impact of incorporating dense-neighborhood information via a precomputed corpus graph, enabling strong effectiveness gains without sacrificing latency. Potential extensions include cross-lingual and user-history enriched retrieval scenarios, as well as dynamic graph construction and adaptive fusion parameter tuning.

Abstract

Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods approach problems differently and use complementary relevance signals, approximation methods were proposed to balance effectiveness and efficiency. For efficiency, approximation methods like HNSW are frequently used to approximate exhaustive dense retrieval. However, approximation techniques still exhibit considerably higher latency than sparse approaches. We propose LexBoost that first builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing. Then, during retrieval, we consider both a document's lexical relevance scores and its neighbors' scores to rank the documents. In LexBoost this remarkably simple application of the Cluster Hypothesis contributes to stronger ranking effectiveness while contributing little computational overhead (since the corpus graph is constructed offline). The method is robust across the number of neighbors considered, various fusion parameters for determining the scores, and different dataset construction methods. We also show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.
Paper Structure (19 sections, 1 equation, 5 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 1 equation, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: MAP and Recall(rel=2)@1000 for LexBoost on BM25, PL2, DPH, QLD - TREC DL 2019. The faint horizontal lines are respective baselines (i.e., $\lambda=1$).
  • Figure 2: System Architecture
  • Figure 3: Heat-Maps showing impact of variation in fusion parameter $\lambda$ and no. of neighbors $n$ on LexBoost.
  • Figure 4: Validation based optimization for determination of fusion parameter $\lambda$.
  • Figure 5: Comparison of LexBoost Re-ranking with exhaustive dense retrieval with TCT-Colbert-HNP and TAS-B on TREC DL 2019 query set.