Table of Contents
Fetching ...

DocReLM: Mastering Document Retrieval with Language Model

Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin, Han-Sen Zhong, Wanli Ouyang

TL;DR

DocReLM addresses semantic document retrieval in massive academic corpora by leveraging large language models (LLMs) both to generate task-specific training data for retriever and reranker and to operate as a search agent that utilizes reference relationships. The system comprises a dense retriever trained with contrastive learning on LLM-generated pseudo-queries, a cross-encoder reranker, and a reference extractor that expands results by extracting candidate papers from cited references. Evaluations on quantum physics and computer vision benchmarks show substantial improvements over baselines such as Google Scholar, with notable gains in top-k metrics across domains. The approach demonstrates the practical potential of integrating LLMs with retrieval systems to enable context-aware, citation-network-guided search and suggests avenues for iterative, multi-hop exploration across references.

Abstract

With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.

DocReLM: Mastering Document Retrieval with Language Model

TL;DR

DocReLM addresses semantic document retrieval in massive academic corpora by leveraging large language models (LLMs) both to generate task-specific training data for retriever and reranker and to operate as a search agent that utilizes reference relationships. The system comprises a dense retriever trained with contrastive learning on LLM-generated pseudo-queries, a cross-encoder reranker, and a reference extractor that expands results by extracting candidate papers from cited references. Evaluations on quantum physics and computer vision benchmarks show substantial improvements over baselines such as Google Scholar, with notable gains in top-k metrics across domains. The approach demonstrates the practical potential of integrating LLMs with retrieval systems to enable context-aware, citation-network-guided search and suggests avenues for iterative, multi-hop exploration across references.

Abstract

With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.
Paper Structure (15 sections, 1 equation, 3 figures, 3 tables)

This paper contains 15 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Accuracy of the final system
  • Figure 2: The training and inference of DocReLM. The LLMs are used in both training and inference.
  • Figure 3: An Example for the reference extraction. A query is entered by a user. Retriever first return $n$ candidate passages from the entire corpus. Then the reranker sorts these passages. Finally, the reference extractor read the top 10 papers and extract three references for each of these paper, which is inserted after that original paper.