Table of Contents
Fetching ...

CASPER: Concept-integrated Sparse Representation for Scientific Retrieval

Lam Thanh Do, Linh Van Nguyen, Jiayu Li, David Fu, Kevin Chen-Chuan Chang

TL;DR

CASPER introduces a concept-integrated sparse retrieval framework that represents scientific texts with both tokens and keyphrases, enabling matching at granular and concept levels. By constructing a large, comprehensive keyphrase vocabulary and training with decoupled token- and keyphrase-representation losses, CASPER leverages FRIEREN-derived supervision from scholarly references to achieve strong cross-domain retrieval performance across eight benchmarks. The paper also demonstrates practical benefits in efficiency via pruning and capabilities in keyphrase generation, with CASPER++ further enhancing effectiveness when combined with BM25. Overall, CASPER advances end-to-end, concept-aware scientific IR and provides insights into data-source importance and pooling strategies for sparse representations.

Abstract

Identifying relevant research concepts is crucial for effective scientific search. However, primary sparse retrieval methods often lack concept-aware representations. To address this, we propose CASPER, a sparse retrieval model for scientific search that utilizes both tokens and keyphrases as representation units (i.e., dimensions in the sparse embedding space). This enables CASPER to represent queries and documents via research concepts and match them at both granular and conceptual levels. Furthermore, we construct training data by leveraging abundant scholarly references (including titles, citation contexts, author-assigned keyphrases, and co-citations), which capture how research concepts are expressed in diverse settings. Empirically, CASPER outperforms strong dense and sparse retrieval baselines across eight scientific retrieval benchmarks. We also explore the effectiveness-efficiency trade-off via representation pruning and demonstrate CASPER's interpretability by showing that it can serve as an effective and efficient keyphrase generation model.

CASPER: Concept-integrated Sparse Representation for Scientific Retrieval

TL;DR

CASPER introduces a concept-integrated sparse retrieval framework that represents scientific texts with both tokens and keyphrases, enabling matching at granular and concept levels. By constructing a large, comprehensive keyphrase vocabulary and training with decoupled token- and keyphrase-representation losses, CASPER leverages FRIEREN-derived supervision from scholarly references to achieve strong cross-domain retrieval performance across eight benchmarks. The paper also demonstrates practical benefits in efficiency via pruning and capabilities in keyphrase generation, with CASPER++ further enhancing effectiveness when combined with BM25. Overall, CASPER advances end-to-end, concept-aware scientific IR and provides insights into data-source importance and pooling strategies for sparse representations.

Abstract

Identifying relevant research concepts is crucial for effective scientific search. However, primary sparse retrieval methods often lack concept-aware representations. To address this, we propose CASPER, a sparse retrieval model for scientific search that utilizes both tokens and keyphrases as representation units (i.e., dimensions in the sparse embedding space). This enables CASPER to represent queries and documents via research concepts and match them at both granular and conceptual levels. Furthermore, we construct training data by leveraging abundant scholarly references (including titles, citation contexts, author-assigned keyphrases, and co-citations), which capture how research concepts are expressed in diverse settings. Empirically, CASPER outperforms strong dense and sparse retrieval baselines across eight scientific retrieval benchmarks. We also explore the effectiveness-efficiency trade-off via representation pruning and demonstrate CASPER's interpretability by showing that it can serve as an effective and efficient keyphrase generation model.

Paper Structure

This paper contains 30 sections, 9 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Example of scholarly references
  • Figure 2: Overview of CASPER, our proposed method.
  • Figure 3: Retrieval performance drop as different data sources are removed.
  • Figure 4: Trade-off between effectiveness (nDCG@10) and efficiency (Disk Space and Latency) for CASPER, CASPER++ (at varying pruning levels), and baselines. nDCG@10 is averaged across eight benchmark datasets. Efficiency metrics (Disk Space in GB; Latency in ms) are measured on CSFCube, the largest dataset, using a single-threaded CPU. Latency values exclude encoding time and are averaged over three runs.
  • Figure 5: Retrieval performance (averaged across eight datasets) with different values of $\beta$