Table of Contents
Fetching ...

Hierarchical corpus encoder: Fusing generative retrieval and dense indices

Tongfei Chen, Ankita Sharma, Adam Pauls, Benjamin Van Durme

TL;DR

The paper addresses the challenge of scalable ad hoc retrieval that generalizes to unseen documents by fusing dense encoders with a structured document hierarchy. It introduces Hierarchical Corpus Encoder (HCE), which jointly learns vector representations and a hierarchical index, and trains with a hierarchy-aware, multi-level contrastive loss that leverages sibling negatives. Empirical results show that HCE outperforms both dense and generative baselines in supervised and unsupervised settings, with particular strengths at larger recall cutoffs and robustness to incremental updates. The work demonstrates that modeling the document set as a hierarchy yields zero-shot adaptability and efficient index updates, offering a practical path toward scalable, high-performance retrieval systems.

Abstract

Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.

Hierarchical corpus encoder: Fusing generative retrieval and dense indices

TL;DR

The paper addresses the challenge of scalable ad hoc retrieval that generalizes to unseen documents by fusing dense encoders with a structured document hierarchy. It introduces Hierarchical Corpus Encoder (HCE), which jointly learns vector representations and a hierarchical index, and trains with a hierarchy-aware, multi-level contrastive loss that leverages sibling negatives. Empirical results show that HCE outperforms both dense and generative baselines in supervised and unsupervised settings, with particular strengths at larger recall cutoffs and robustness to incremental updates. The work demonstrates that modeling the document set as a hierarchy yields zero-shot adaptability and efficient index updates, offering a practical path toward scalable, high-performance retrieval systems.

Abstract

Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.

Paper Structure

This paper contains 37 sections, 8 equations, 7 figures, 5 tables, 3 algorithms.

Figures (7)

  • Figure 1: Left: DSI with atomic IDs. Right: DSI with a document hierarchy.
  • Figure 2: A document hierarchy with depth 3.
  • Figure 3: Illustration of HCE training, where the query is contrasted with tiered negative samples. Query and documents here are taken from the NQ320k dataset KwiatkowskiPRCP19.
  • Figure 4: Recall@$\{1,5,10,100\}$ for various branching factors $b$ under NQ320k.
  • Figure 5: Recall@$\{1,10\}$ for incremental updates on NQ320k.
  • ...and 2 more figures