Table of Contents
Fetching ...

A Survey of Long-Document Retrieval in the PLM and LLM Era

Minghan Li, Miyang Luo, Tianrui Lv, Yishuai Zhang, Siqi Zhao, Ercong Nie, Guodong Zhou

TL;DR

Long-Document Retrieval (LDR) addresses the challenge of locating precise information within extremely lengthy texts by leveraging a progression from lexical and early neural methods to PLMs and LLMs. The survey organizes approaches into four complementary paradigms—Holistic modeling, Divide-and-Conquer processing, Indexing-Structure innovations, and Long-Query retrieval—while highlighting efficiency techniques, domain applications, and evaluation resources. It emphasizes practical open problems such as efficiency, faithfulness, and robustness, and proposes a forward-looking agenda advocating hybrid systems that combine indexing, sparse long-context attention, and LLM reasoning. The work provides actionable guidance for building scalable, domain-aware LDR systems and situates them within real-world applications like legal discovery, biomedical literature search, and cross-lingual retrieval, paving the way for robust evidence synthesis at scale.

Abstract

The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.

A Survey of Long-Document Retrieval in the PLM and LLM Era

TL;DR

Long-Document Retrieval (LDR) addresses the challenge of locating precise information within extremely lengthy texts by leveraging a progression from lexical and early neural methods to PLMs and LLMs. The survey organizes approaches into four complementary paradigms—Holistic modeling, Divide-and-Conquer processing, Indexing-Structure innovations, and Long-Query retrieval—while highlighting efficiency techniques, domain applications, and evaluation resources. It emphasizes practical open problems such as efficiency, faithfulness, and robustness, and proposes a forward-looking agenda advocating hybrid systems that combine indexing, sparse long-context attention, and LLM reasoning. The work provides actionable guidance for building scalable, domain-aware LDR systems and situates them within real-world applications like legal discovery, biomedical literature search, and cross-lingual retrieval, paving the way for robust evidence synthesis at scale.

Abstract

The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.

Paper Structure

This paper contains 41 sections, 12 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: In queries targeting long documents, a comparison between general retrieval methods and long-document retrieval methods reveals distinct differences: general retrieval methods struggle to acquire and present detailed, scattered information within long documents, whereas long-document retrieval methods excel at retrieving and organizing in-depth content from large volumes of textual resources.
  • Figure 2: A structured taxonomy of Long-Document Retrieval, categorizing existing research across eras, core paradigms, applications, and evaluation methods.
  • Figure 3: An overview of the long-document retrieval paradigm in the PLM and LLM era. Methods evolve from (1) The Holistic Paradigm in the PLM & LLM Era to (2) Divide-and-conquer Paradigm for Long Documents and (3) Long-Query Retrieval, reflecting the field’s progression in balancing effectiveness, efficiency, and scalability.
  • Figure 4: A typical workflow for the key block selection approach within the divide-and-conquer paradigm, exemplified by models like KeyB. This approach concatenates the text of selected blocks before a final reranking.
  • Figure 5: Conceptual overview of three indexing-structure-oriented paradigms: (a) MC-indexing, (b) HELD, and (c) RAPTOR.
  • ...and 1 more figures