Table of Contents
Fetching ...

From Retrieval to Generation: Comparing Different Approaches

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt

TL;DR

The paper tackles the problem of balancing retrieval accuracy with generative flexibility in knowledge-intensive tasks by systematically comparing retrieval-based, generation-based, and hybrid models across open-domain QA, information retrieval, and language modeling. It evaluates classic sparse and dense retrievers, generator-based approaches, and hybrid architectures that integrate retrieved and generated content, complemented by reranking techniques like UPR and RankGPT. Key findings show dense retrievers (e.g., DPR, MSS-DPR) achieve strong ODQA performance, while generative models excel on certain datasets but can struggle with factuality; hybrids offer balanced improvements in QA and IR at the expense of higher compute and potential redundancy. The work provides practical guidance for deploying retrieval, reranking, and retrieval-augmented generation in real-world, knowledge-intensive applications and highlights the need for careful design to manage hallucination, scalability, and domain transfer.

Abstract

Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.

From Retrieval to Generation: Comparing Different Approaches

TL;DR

The paper tackles the problem of balancing retrieval accuracy with generative flexibility in knowledge-intensive tasks by systematically comparing retrieval-based, generation-based, and hybrid models across open-domain QA, information retrieval, and language modeling. It evaluates classic sparse and dense retrievers, generator-based approaches, and hybrid architectures that integrate retrieved and generated content, complemented by reranking techniques like UPR and RankGPT. Key findings show dense retrievers (e.g., DPR, MSS-DPR) achieve strong ODQA performance, while generative models excel on certain datasets but can struggle with factuality; hybrids offer balanced improvements in QA and IR at the expense of higher compute and potential redundancy. The work provides practical guidance for deploying retrieval, reranking, and retrieval-augmented generation in real-world, knowledge-intensive applications and highlights the need for careful design to manage hallucination, scalability, and domain transfer.

Abstract

Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.

Paper Structure

This paper contains 20 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of experimental setup across the three tasks of open-domain question answering (QA), Information Retrieval, and Language Modeling.
  • Figure 2: Perplexity Comparison for Language Modeling with Retrieval, Generation, and Hybrid Context Strategies using Different Generator Documents. The figure is divided into two subplots, each representing a different document generator used for providing context: (a) Llama-3 70B Generator Document and (b) GPT3.5 Generator Document. In both subplots, language model perplexity is evaluated under several context strategies: 'No Context' (baseline), 'R' (Retrieval-only using BM25 from Wikipedia), 'G' (Generation-only, context from a generated document), 'R+G' (Retrieval followed by Generation), and 'G+R' (Generation followed by Retrieval).