Generative Retrieval for Book search

Yubao Tang; Ruqing Zhang; Jiafeng Guo; Maarten de Rijke; Shihao Liu; Shuaiqing Wang; Dawei Yin; Xueqi Cheng

Generative Retrieval for Book search

Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Shihao Liu, Shuaiqing Wang, Dawei Yin, Xueqi Cheng

TL;DR

This paper tackles book search by applying generative retrieval to leverage full book information—metadata, outlines, and main text—through data augmentation and outline-oriented encoding. It introduces GBS, a transformer-based framework with coverage- and diversity-driven augmentation, and outline-aware encoding via bi-level positional encoding and retentive attention to manage long, hierarchical content. The approach is trained with multi-task objectives for indexing and retrieval and evaluated on Baidu’s dataset and public data, where GBS substantially outperforms state-of-the-art GR baselines, including RIPOR, especially at long input lengths. The work demonstrates significant practical gains for book search while outlining avenues for reducing training costs and scaling to larger backbones.

Abstract

In book search, relevant book information should be returned in response to a query. Books contain complex, multi-faceted information such as metadata, outlines, and main text, where the outline provides hierarchical information between chapters and sections. Generative retrieval (GR) is a new retrieval paradigm that consolidates corpus information into a single model to generate identifiers of documents that are relevant to a given query. How can GR be applied to book search? Directly applying GR to book search is a challenge due to the unique characteristics of book search: The model needs to retain the complex, multi-faceted information of the book, which increases the demand for labeled data. Splitting book information and treating it as a collection of separate segments for learning might result in a loss of hierarchical information. We propose an effective Generative retrieval framework for Book Search (GBS) that features two main components: data augmentation and outline-oriented book encoding. For data augmentation, GBS constructs multiple query-book pairs for training; it constructs multiple book identifiers based on the outline, various forms of book contents, and simulates real book retrieval scenarios with varied pseudo-queries. This includes coverage-promoting book identifier augmentation, allowing the model to learn to index effectively, and diversity-enhanced query augmentation, allowing the model to learn to retrieve effectively. Outline-oriented book encoding improves length extrapolation through bi-level positional encoding and retentive attention mechanisms to maintain context over long sequences. Experiments on a proprietary Baidu dataset demonstrate that GBS outperforms strong baselines, achieving a 9.8\% improvement in terms of MRR@20, over the state-of-the-art RIPOR method...

Generative Retrieval for Book search

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 4 figures, 5 tables)

This paper contains 25 sections, 5 equations, 4 figures, 5 tables.

Introduction
Methodology
Problem statement
Model architecture
Data augmentation
Coverage-promoting book identifier augmentation for indexing
Diversity-enhanced query augmentation for retrieval
Outline-oriented book encoding
Outline-oriented bi-level positional encoding
Outline-oriented retentive attention
Training
Inference
GBS
Experimental Settings
Implementation details
...and 10 more sections

Figures (4)

Figure 1: Books mainly consist of three types of information: (1) metadata, which includes details like the title, author, and publisher; (2) the main text, which constitutes the core content of the book; and (3) the outline, which shows the hierarchical structure and relationships between the chapters and sections.
Figure 2: Based on an encoder-decoder architecture, GBS comprises two components: (1) Data augmentation (orange dashed rectangles), which includes coverage-promoting book identifier augmentation for indexing and diverse-enhanced query augmentation for retrieval, generating multiple data pairs. (2) Outline-oriented book encoding, which includes outline-oriented bi-level positional encoding (green dashed rectangles) and outline-oriented retentive attention (blue dashed rectangles), to encode the long book contents based on hierarchical information. (The figure should be viewed in color.)
Figure 3: The performance, in terms of Hits@10, of GBS$^P$ and RIPOR with different input lengths on the BBS 40K dataset.
Figure 4: The performance, in terms of Hits@10, of GBS$^P$ with different numbers of diversity-enhanced pseudo-queries, i.e., $X$, on the BBS 40K dataset.

Generative Retrieval for Book search

TL;DR

Abstract

Generative Retrieval for Book search

Authors

TL;DR

Abstract

Table of Contents

Figures (4)