Distillation Enhanced Generative Retrieval
Yongqi Li, Zhen Zhang, Wenjie Wang, Liqiang Nie, Wenjie Li, Tat-Seng Chua
TL;DR
Distillation Enhanced Generative Retrieval (DGR) introduces a teacher–student framework to improve generative retrieval by distilling graded passage rankings from a powerful teacher into a generative retriever. It introduces a distilled RankNet loss that leverages teacher ranking orders and demonstrates strong, robust gains across NQ, TriviaQA, MSMARCO, and TREC DL while keeping inference unchanged. The approach shows that knowledge distillation can close the gap to dense retrieval within the generative paradigm and remains effective across different teacher architectures and distillation losses. The work points to future directions such as longer teacher rankings and more nuanced sampling strategies to further boost performance.
Abstract
Generative retrieval is a promising new paradigm in text retrieval that generates identifier strings of relevant passages as the retrieval target. This paradigm leverages powerful generative language models, distinct from traditional sparse or dense retrieval methods. In this work, we identify a viable direction to further enhance generative retrieval via distillation and propose a feasible framework, named DGR. DGR utilizes sophisticated ranking models, such as the cross-encoder, in a teacher role to supply a passage rank list, which captures the varying relevance degrees of passages instead of binary hard labels; subsequently, DGR employs a specially designed distilled RankNet loss to optimize the generative retrieval model, considering the passage rank order provided by the teacher model as labels. This framework only requires an additional distillation step to enhance current generative retrieval systems and does not add any burden to the inference stage. We conduct experiments on four public datasets, and the results indicate that DGR achieves state-of-the-art performance among the generative retrieval methods. Additionally, DGR demonstrates exceptional robustness and generalizability with various teacher models and distillation losses.
