Table of Contents
Fetching ...

Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval

Priyanka Mandikal, Raymond Mooney

TL;DR

This work investigates whether dense, transformer-based embeddings outperform traditional sparse IR in scientific document retrieval and finds that SPECTER2 is not consistently superior on the cystic fibrosis benchmark. It introduces a simple hybrid retriever that combines dense and sparse representations, formalized as $S_{hybrid}(D,Q) = \lambda \ \text{Sim}(z_{dense}(D), z_{dense}(Q)) + (1 - \lambda) \ \text{Sim}(z_{sparse}(D), z_{sparse}(Q))$, and demonstrates that this hybrid significantly outperforms both baselines. Across precision-recall and NDCG metrics, the hybrid approach yields notable gains, with λ = 0.8 providing the best trade-off in the experiments. The results suggest that integrating classical TF-IDF with modern dense embeddings can robustly improve retrieval in specialized scientific domains, guiding future IR designs for domain-specific corpora.

Abstract

Traditional information retrieval is based on sparse bag-of-words vector representations of documents and queries. More recent deep-learning approaches have used dense embeddings learned using a transformer-based large language model. We show that on a classic benchmark on scientific document retrieval in the medical domain of cystic fibrosis, that both of these models perform roughly equivalently. Notably, dense vectors from the state-of-the-art SPECTER2 model do not significantly enhance performance. However, a hybrid model that we propose combining these methods yields significantly better results, underscoring the merits of integrating classical and contemporary deep learning techniques in information retrieval in the domain of specialized scientific documents.

Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval

TL;DR

This work investigates whether dense, transformer-based embeddings outperform traditional sparse IR in scientific document retrieval and finds that SPECTER2 is not consistently superior on the cystic fibrosis benchmark. It introduces a simple hybrid retriever that combines dense and sparse representations, formalized as , and demonstrates that this hybrid significantly outperforms both baselines. Across precision-recall and NDCG metrics, the hybrid approach yields notable gains, with λ = 0.8 providing the best trade-off in the experiments. The results suggest that integrating classical TF-IDF with modern dense embeddings can robustly improve retrieval in specialized scientific domains, guiding future IR designs for domain-specific corpora.

Abstract

Traditional information retrieval is based on sparse bag-of-words vector representations of documents and queries. More recent deep-learning approaches have used dense embeddings learned using a transformer-based large language model. We show that on a classic benchmark on scientific document retrieval in the medical domain of cystic fibrosis, that both of these models perform roughly equivalently. Notably, dense vectors from the state-of-the-art SPECTER2 model do not significantly enhance performance. However, a hybrid model that we propose combining these methods yields significantly better results, underscoring the merits of integrating classical and contemporary deep learning techniques in information retrieval in the domain of specialized scientific documents.
Paper Structure (15 sections, 1 equation, 5 figures)

This paper contains 15 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Overview of our approach. On a medical dataset of cystic fibrosis documents, we combine sparse bag-of-words embeddings with dense embeddings from a SOTA LLM (Specter2 singh2023scirepeval) to produce a hybrid retriever that significantly outperforms both methods.
  • Figure 2: Results on the Cystic-Fibrosis dataset. The hybrid approach ($\lambda=0.8$) outperforms both traditional sparse vector-space retrieval (VSR) and state-of-the-art deep embeddings (SPECTER2 singh2023scirepeval) in both PR (left) as well as NDCG (right) metrics.
  • Figure 3: Sample queries and retrievals. We show results for the three methods on four sample queries. Our hybrid approach outperforms both VSR and SPECTER2, retrieving more number of relevant documents higher in the retrieval ranks. We also show sample retrievals relevant to the given query with keywords highlighted.
  • Figure 4: Ablations with different values of $\lambda$. Weighing the deep embedding with a weight of $\lambda=0.8$ (where sparse gets a weight of $0.2$) produces the best results for both PR and NDCG metrics.
  • Figure 5: Ablations with SPECTER2 base vs adapter versions. The base model performs as well as the adapter variant on the PR metric. The adapters marginally improve performance on NDCG but hurt precision for high recall levels .