Generator-Retriever-Generator Approach for Open-Domain Question Answering
Abdelrahman Abdallah, Adam Jatowt
TL;DR
The paper addresses open-domain QA by integrating document generation and retrieval into a Generator-Retriever-Generator (GRG) pipeline. A first LLM generates contextual documents for a given question, while a dual-encoder retriever fetches relevant external documents; a second LLM then produces the final answer conditioned on both sources. GRG demonstrates substantial gains over state-of-the-art generate-then-read and retrieve-then-read pipelines across TriviaQA, Natural Questions, and WebQ, with notable EM improvements and evidence that jointly leveraging generated and retrieved documents enhances answer quality. The approach is accompanied by an implemented system, dataset usage, and comprehensive ablations, underscoring its potential for scalable, high-precision open-domain QA in real-world settings.
Abstract
Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus. The generated and retrieved documents are then passed to the second LLM, which generates the final answer. By combining document retrieval and LLM generation, our approach addresses the challenges of open-domain QA, such as generating informative and contextually relevant answers. GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines (GENREAD and RFiD) improving their performance by at least by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively. We provide code, datasets, and checkpoints at https://github.com/abdoelsayed2016/GRG.
