Table of Contents
Fetching ...

A Survey of Generative Information Retrieval

Tzu-Lin Kuo, Tzu-Wei Chiu, Tzung-Sheng Lin, Sheng-Yang Wu, Chao-Wei Huang, Yun-Nung Chen

TL;DR

This survey addresses the shift from traditional IR to Generative Retrieval (GR), where a seq2seq model directly maps user queries to document identifiers (DocIDs) without explicit query processing or reranking. It defines GR, surveys indexing and retrieval mechanisms, and classifies DocID strategies into numerical and string-based types, highlighting that semantically informed identifiers typically yield stronger retrieval signals. The paper analyzes evaluation frameworks, baselines, and dataset usage (e.g., MS MARCO, Natural Questions), and discusses scalability and dynamic-corpus challenges, proposing learnable DocIDs, higher-quality query generation, and decoder-based LLMs as promising directions. It concludes with actionable future directions in training methods, indexing scalability, and multi-task learning to advance GR's practicality and performance in real-world information retrieval systems.

Abstract

Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to directly map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking. This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges. We discuss various document identifier strategies, including numerical and string-based identifiers, and explore different document representation methods. Our primary contribution lies in outlining future research directions that could profoundly impact the field: improving the quality of query generation, exploring learnable document identifiers, enhancing scalability, and integrating GR with multi-task learning frameworks. By examining state-of-the-art GR techniques and their applications, this survey aims to provide a foundational understanding of GR and inspire further innovations in this transformative approach to information retrieval. We also make the complementary materials such as paper collection publicly available at https://github.com/MiuLab/GenIR-Survey/

A Survey of Generative Information Retrieval

TL;DR

This survey addresses the shift from traditional IR to Generative Retrieval (GR), where a seq2seq model directly maps user queries to document identifiers (DocIDs) without explicit query processing or reranking. It defines GR, surveys indexing and retrieval mechanisms, and classifies DocID strategies into numerical and string-based types, highlighting that semantically informed identifiers typically yield stronger retrieval signals. The paper analyzes evaluation frameworks, baselines, and dataset usage (e.g., MS MARCO, Natural Questions), and discusses scalability and dynamic-corpus challenges, proposing learnable DocIDs, higher-quality query generation, and decoder-based LLMs as promising directions. It concludes with actionable future directions in training methods, indexing scalability, and multi-task learning to advance GR's practicality and performance in real-world information retrieval systems.

Abstract

Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to directly map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking. This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges. We discuss various document identifier strategies, including numerical and string-based identifiers, and explore different document representation methods. Our primary contribution lies in outlining future research directions that could profoundly impact the field: improving the quality of query generation, exploring learnable document identifiers, enhancing scalability, and integrating GR with multi-task learning frameworks. By examining state-of-the-art GR techniques and their applications, this survey aims to provide a foundational understanding of GR and inspire further innovations in this transformative approach to information retrieval. We also make the complementary materials such as paper collection publicly available at https://github.com/MiuLab/GenIR-Survey/
Paper Structure (29 sections, 3 figures, 5 tables)

This paper contains 29 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Progression of information retrieval from sparse vector similarity techniques, such as the bag-of-words and Vector Space Model, to dense retrieval with innovations like Word2Vec and BERT, culminating in sophisticated systems like DPR. Advances in generative retrieval now integrate language models for direct response generation.
  • Figure 2: The Generative Retrieval system consists of two primary stages: In the Indexing Stage, specific queries like "What is O11y?" and "Who founded Google?" are linked with their corresponding documents to DocIDs (DocIDs 258 and 147, respectively) through a seq2seq learning system, ensuring accurate query-DocID and document-DocID associations. The Retrieval Stage processes a user query ("Who founded Google?") to autoregressively output the relevant DocID, eliminating the need for additional query processing and document reranking. This direct mapping highlights the system's capability for efficient, end-to-end retrieval based on learned relationships.
  • Figure 3: Different types of document identifiers. We categorize docid into two types: numerical identifiers and string identifiers. Numerical identifiers use numbers as identifiers and are further classified into single token and sequential tokens based on the number of tokens used to represent each docid. With sequential tokens, the model decodes tokens sequentially, one by one, for each docid. Depending on the method used to create the hierarchy structure, sequential tokens can be further divided into arbitrary structured and semantically structured identifiers. With string identifiers, the model directly decodes strings as docid. Based on the type of string used, we divide them into subset of strings, titles or URLs, and pseudo queries.