Table of Contents
Fetching ...

PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval

Wonbin Kweon, Runchu Tian, SeongKu Kang, Pengcheng Jiang, Zhiyong Lu, Jiawei Han, Hwanjo Yu

TL;DR

PairSem introduces a principled, unsupervised framework for fine-grained scientific document retrieval by modeling semantics as entity–aspect pairs. It combines offline LLM-based pair generation with corpus-level synonym merging, candidate augmentation, and lightweight predictors to enable efficient inference (PairSem_fast) without relying on query–document labels. Through extensive experiments on Chemistry, Biomedical, and Computer Science datasets with multiple base retrievers, PairSem yields consistent retrieval gains and favorable time–accuracy trade-offs, including substantial recalls on LitSearch and improvements over strong baselines like SemRank. The work demonstrates that explicitly capturing multi-aspect semantics of scientific concepts significantly enhances document matching and offers practical, plug-and-play applicability to existing dense retrievers.

Abstract

Scientific document retrieval is a critical task for enabling knowledge discovery and supporting research across diverse domains. However, existing dense retrieval methods often struggle to capture fine-grained scientific concepts in texts due to their reliance on holistic embeddings and limited domain understanding. Recent approaches leverage large language models (LLMs) to extract fine-grained semantic entities and enhance semantic matching, but they typically treat entities as independent fragments, overlooking the multi-faceted nature of scientific concepts. To address this limitation, we propose Pairwise Semantic Matching (PairSem), a framework that represents relevant semantics as entity-aspect pairs, capturing complex, multi-faceted scientific concepts. PairSem is unsupervised, base retriever-agnostic, and plug-and-play, enabling precise and context-aware matching without requiring query-document labels or entity annotations. Extensive experiments on multiple datasets and retrievers demonstrate that PairSem significantly improves retrieval performance, highlighting the importance of modeling multi-aspect semantics in scientific information retrieval.

PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval

TL;DR

PairSem introduces a principled, unsupervised framework for fine-grained scientific document retrieval by modeling semantics as entity–aspect pairs. It combines offline LLM-based pair generation with corpus-level synonym merging, candidate augmentation, and lightweight predictors to enable efficient inference (PairSem_fast) without relying on query–document labels. Through extensive experiments on Chemistry, Biomedical, and Computer Science datasets with multiple base retrievers, PairSem yields consistent retrieval gains and favorable time–accuracy trade-offs, including substantial recalls on LitSearch and improvements over strong baselines like SemRank. The work demonstrates that explicitly capturing multi-aspect semantics of scientific concepts significantly enhances document matching and offers practical, plug-and-play applicability to existing dense retrievers.

Abstract

Scientific document retrieval is a critical task for enabling knowledge discovery and supporting research across diverse domains. However, existing dense retrieval methods often struggle to capture fine-grained scientific concepts in texts due to their reliance on holistic embeddings and limited domain understanding. Recent approaches leverage large language models (LLMs) to extract fine-grained semantic entities and enhance semantic matching, but they typically treat entities as independent fragments, overlooking the multi-faceted nature of scientific concepts. To address this limitation, we propose Pairwise Semantic Matching (PairSem), a framework that represents relevant semantics as entity-aspect pairs, capturing complex, multi-faceted scientific concepts. PairSem is unsupervised, base retriever-agnostic, and plug-and-play, enabling precise and context-aware matching without requiring query-document labels or entity annotations. Extensive experiments on multiple datasets and retrievers demonstrate that PairSem significantly improves retrieval performance, highlighting the importance of modeling multi-aspect semantics in scientific information retrieval.

Paper Structure

This paper contains 36 sections, 14 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Example of retrieval in Chemistry domain. We highlighted relevant entities in Red and associated aspects in Blue. Detailed case study is presented in Table \ref{['tab:casestudy']}, Appendix.
  • Figure 2: Overview of Pairwise Semantic Matching (PairSem) framework. The processes in §\ref{['Sec:4.1']} are performed offline prior to query arrival, while only §\ref{['Sec:4.2']} is executed at inference time.
  • Figure 3: Time-accuracy trade-off on ChemLit-QA.
  • Figure 4: Number of unique aspects per entity in corpus.
  • Figure 5: Hyperparameter study on PairSem and PairSem$_{\text{fast}}$.