DocReLM: Mastering Document Retrieval with Language Model
Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin, Han-Sen Zhong, Wanli Ouyang
TL;DR
DocReLM addresses semantic document retrieval in massive academic corpora by leveraging large language models (LLMs) both to generate task-specific training data for retriever and reranker and to operate as a search agent that utilizes reference relationships. The system comprises a dense retriever trained with contrastive learning on LLM-generated pseudo-queries, a cross-encoder reranker, and a reference extractor that expands results by extracting candidate papers from cited references. Evaluations on quantum physics and computer vision benchmarks show substantial improvements over baselines such as Google Scholar, with notable gains in top-k metrics across domains. The approach demonstrates the practical potential of integrating LLMs with retrieval systems to enable context-aware, citation-network-guided search and suggests avenues for iterative, multi-hop exploration across references.
Abstract
With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.
