CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
Lam Thanh Do, Linh Van Nguyen, Jiayu Li, David Fu, Kevin Chen-Chuan Chang
TL;DR
CASPER introduces a concept-integrated sparse retrieval framework that represents scientific texts with both tokens and keyphrases, enabling matching at granular and concept levels. By constructing a large, comprehensive keyphrase vocabulary and training with decoupled token- and keyphrase-representation losses, CASPER leverages FRIEREN-derived supervision from scholarly references to achieve strong cross-domain retrieval performance across eight benchmarks. The paper also demonstrates practical benefits in efficiency via pruning and capabilities in keyphrase generation, with CASPER++ further enhancing effectiveness when combined with BM25. Overall, CASPER advances end-to-end, concept-aware scientific IR and provides insights into data-source importance and pooling strategies for sparse representations.
Abstract
Identifying relevant research concepts is crucial for effective scientific search. However, primary sparse retrieval methods often lack concept-aware representations. To address this, we propose CASPER, a sparse retrieval model for scientific search that utilizes both tokens and keyphrases as representation units (i.e., dimensions in the sparse embedding space). This enables CASPER to represent queries and documents via research concepts and match them at both granular and conceptual levels. Furthermore, we construct training data by leveraging abundant scholarly references (including titles, citation contexts, author-assigned keyphrases, and co-citations), which capture how research concepts are expressed in diverse settings. Empirically, CASPER outperforms strong dense and sparse retrieval baselines across eight scientific retrieval benchmarks. We also explore the effectiveness-efficiency trade-off via representation pruning and demonstrate CASPER's interpretability by showing that it can serve as an effective and efficient keyphrase generation model.
