Hierarchical corpus encoder: Fusing generative retrieval and dense indices
Tongfei Chen, Ankita Sharma, Adam Pauls, Benjamin Van Durme
TL;DR
The paper addresses the challenge of scalable ad hoc retrieval that generalizes to unseen documents by fusing dense encoders with a structured document hierarchy. It introduces Hierarchical Corpus Encoder (HCE), which jointly learns vector representations and a hierarchical index, and trains with a hierarchy-aware, multi-level contrastive loss that leverages sibling negatives. Empirical results show that HCE outperforms both dense and generative baselines in supervised and unsupervised settings, with particular strengths at larger recall cutoffs and robustness to incremental updates. The work demonstrates that modeling the document set as a hierarchy yields zero-shot adaptability and efficient index updates, offering a practical path toward scalable, high-performance retrieval systems.
Abstract
Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.
