LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

Hrishikesh Kulkarni; Nazli Goharian; Ophir Frieder; Sean MacAvaney

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney

TL;DR

LexBoost addresses the gap between fast lexical retrieval and semantically rich dense retrieval by offline-building a corpus graph of dense neighbors and online enriching lexical scores with neighbors' lexical signals. The method uses a convex fusion of a document's lexical score and the mean lexical scores of its neighbors, formalized as $LexBoost(q,d,D) = \lambda \cdot \text{score}(q,d) + \frac{1-\lambda}{n} \cdot \sum_{d_{neigh} \in N} \text{score}(q,d_{neigh})$, where $\lambda \in [0,1]$ and $n$ is the neighbor count. Across MS MARCO DL 19/20 and CORD-19/TREC-COVID, LexBoost delivers statistically significant improvements over strong lexical baselines (e.g., BM25, PL2, DPH, QLD) with negligible query-time cost and remains robust to hyperparameters and dataset choice. The findings highlight the practical impact of incorporating dense-neighborhood information via a precomputed corpus graph, enabling strong effectiveness gains without sacrificing latency. Potential extensions include cross-lingual and user-history enriched retrieval scenarios, as well as dynamic graph construction and adaptive fusion parameter tuning.

Abstract

Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods approach problems differently and use complementary relevance signals, approximation methods were proposed to balance effectiveness and efficiency. For efficiency, approximation methods like HNSW are frequently used to approximate exhaustive dense retrieval. However, approximation techniques still exhibit considerably higher latency than sparse approaches. We propose LexBoost that first builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing. Then, during retrieval, we consider both a document's lexical relevance scores and its neighbors' scores to rank the documents. In LexBoost this remarkably simple application of the Cluster Hypothesis contributes to stronger ranking effectiveness while contributing little computational overhead (since the corpus graph is constructed offline). The method is robust across the number of neighbors considered, various fusion parameters for determining the scores, and different dataset construction methods. We also show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

TL;DR

, where

and

is the neighbor count. Across MS MARCO DL 19/20 and CORD-19/TREC-COVID, LexBoost delivers statistically significant improvements over strong lexical baselines (e.g., BM25, PL2, DPH, QLD) with negligible query-time cost and remains robust to hyperparameters and dataset choice. The findings highlight the practical impact of incorporating dense-neighborhood information via a precomputed corpus graph, enabling strong effectiveness gains without sacrificing latency. Potential extensions include cross-lingual and user-history enriched retrieval scenarios, as well as dynamic graph construction and adaptive fusion parameter tuning.

Abstract

Paper Structure (19 sections, 1 equation, 5 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 1 equation, 5 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Lexical Methods
Dense Methods
Approximation Methods
Comparison with LADR
Hybrid Methods
LexBoost
Experiment
Datasets and Measures
Models and Parameters
Baselines and Implementation
Results and Analysis
RQ1 and RQ2: Insights from corpus graph and impact on retrieval
RQ3: Robustness across number of neighbors considered
...and 4 more sections

Figures (5)

Figure 1: MAP and Recall(rel=2)@1000 for LexBoost on BM25, PL2, DPH, QLD - TREC DL 2019. The faint horizontal lines are respective baselines (i.e., $\lambda=1$).
Figure 2: System Architecture
Figure 3: Heat-Maps showing impact of variation in fusion parameter $\lambda$ and no. of neighbors $n$ on LexBoost.
Figure 4: Validation based optimization for determination of fusion parameter $\lambda$.
Figure 5: Comparison of LexBoost Re-ranking with exhaustive dense retrieval with TCT-Colbert-HNP and TAS-B on TREC DL 2019 query set.

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

TL;DR

Abstract

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

Authors

TL;DR

Abstract

Table of Contents

Figures (5)