Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du, Quanjie Han, Chao Lv, Yan Liu, Yalin Sun, Hao Shu, Hongbo Shan, Maosong Sun
TL;DR
BRMGR tackles the label-scarce challenge of merging retrieved and LLM-generated knowledge for open-domain QA by introducing an unsupervised bi-reranking framework. It scores retrieved passages with $p(\\mathbf{q} |\\mathbf{rp}_j)$ and generated passages with $p(\\mathbf{lp}_i |\\mathbf{q})$, then forms cross-source relevance as $p(\\mathbf{lp}_i,\\mathbf{rp}_j |\\mathbf{q}) \\propto p(\\mathbf{lp}_i |\\mathbf{q}) p(\\mathbf{q} |\\mathbf{rp}_j)$ and applies greedy matching, which is equivalent to a bipartite matching loss under a product factorization. The method leverages zero-shot generation to estimate compatibility and avoids silver-label mining. Evaluations on TriviaQA, Natural Questions, and WebQuestions show consistent gains over single-source baselines and competitive results against strong baselines in both retrieval and QA, with ablations highlighting the importance of document-generation-based reranking and the choice of PLMs.
Abstract
Open-domain Question Answering (QA) has garnered substantial interest by combining the advantages of faithfully retrieved passages and relevant passages generated through Large Language Models (LLMs). However, there is a lack of definitive labels available to pair these sources of knowledge. In order to address this issue, we propose an unsupervised and simple framework called Bi-Reranking for Merging Generated and Retrieved Knowledge (BRMGR), which utilizes re-ranking methods for both retrieved passages and LLM-generated passages. We pair the two types of passages using two separate re-ranking methods and then combine them through greedy matching. We demonstrate that BRMGR is equivalent to employing a bipartite matching loss when assigning each retrieved passage with a corresponding LLM-generated passage. The application of our model yielded experimental results from three datasets, improving their performance by +1.7 and +1.6 on NQ and WebQ datasets, respectively, and obtaining comparable result on TriviaQA dataset when compared to competitive baselines.
