Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models

Seiji Maekawa; Hayate Iso; Sairam Gurajada; Nikita Bhutani

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models

Seiji Maekawa, Hayate Iso, Sairam Gurajada, Nikita Bhutani

TL;DR

This work investigates when retrieval augmentation helps language models answer knowledge-intensive questions by introducing WiTQA, a fact-centric QA dataset that factors in entity and relation popularity. Through extensive zero-shot experiments with 10 LMs and multiple retrievers, the authors show that larger models recall popular facts well, while retrieval improves performance on long-tail facts and smaller models. They also demonstrate that indiscriminate augmentation can harm accuracy, and propose a selective memory integration strategy that uses entity-relation and entity frequencies to decide when to retrieve. The results highlight practical implications for deploying RALMs in real-world QA systems, offering thresholds and guidance to balance memory recall and external retrieval to maximize accuracy. Limitations include reliance on Wikipedia-based priors, limited prompt tuning, and focus on triple-based (single-hop) questions, suggesting avenues for future work on multi-hop reasoning and broader corpora.

Abstract

While large language models (LMs) demonstrate remarkable performance, they encounter challenges in providing accurate responses when queried for information beyond their pre-trained memorization. Although augmenting them with relevant external information can mitigate these issues, failure to consider the necessity of retrieval may adversely affect overall performance. Previous research has primarily focused on examining how entities influence retrieval models and knowledge recall in LMs, leaving other aspects relatively unexplored. In this work, our goal is to offer a more detailed, fact-centric analysis by exploring the effects of combinations of entities and relations. To facilitate this, we construct a new question answering (QA) dataset called WiTQA (Wikipedia Triple Question Answers). This dataset includes questions about entities and relations of various popularity levels, each accompanied by a supporting passage. Our extensive experiments with diverse LMs and retrievers reveal when retrieval does not consistently enhance LMs from the viewpoints of fact-centric popularity. Confirming earlier findings, we observe that larger LMs excel in recalling popular facts. However, they notably encounter difficulty with infrequent entity-relation pairs compared to retrievers. Interestingly, they can effectively retain popular relations of less common entities. We demonstrate the efficacy of our finer-grained metric and insights through an adaptive retrieval system that selectively employs retrieval and recall based on the frequencies of entities and relations in the question.

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models

TL;DR

Abstract

Paper Structure (35 sections, 15 figures, 5 tables)

This paper contains 35 sections, 15 figures, 5 tables.

Introduction
Background
Open domain Question Answering
Parametric vs Non-Parametric Knowledge
The WiTQA dataset
Dataset creation
Triple extraction
Triple sampling
Answer candidate expansion
Question generation with roundtrip refinement.
Dataset Statistics
Experiments: Recall or Retrieve
Setup
Analysis of Model's Recall Ability
When Do Retrievers Help
...and 20 more sections

Figures (15)

Figure 1: Overview of WiTQA dataset creation. First, we extract triples from Wikipedia and Wikidata, and compute the frequency of subject-relation pairs and subject entity (referred to as S-R counts and S counts) (§\ref{['sub:triple_extraction']}). Second, we sample triples based on different ranges of S-R counts and select supporting passages based on entailment scores(§\ref{['sub:triple_sampling']}). Third, we expand answer candidates using Wikidata (§\ref{['sub:answer_expansion']}). Finally, we generate questions from triples and iteratively refine generated questions (§\ref{['sub:roundtrip']}).
Figure 2: Histograms of question distributions. WiTQA exhibits greater diversity than existing benchmarks regarding question popularity, as indicated by the variation in S-R counts.
Figure 3: We categorize the questions into bins based on their S-R counts and present LMs accuracy across these bins. Shaded areas are the $95\%$ bootstrap confidence intervals with $1000$ samples. Larger models exhibit higher accuracy than smaller models. Even small models memorize factual knowledge about popular questions.
Figure 4: Accuracy over subject entity page views.
Figure 5: Accuracy over entity counts.
...and 10 more figures

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models

TL;DR

Abstract

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)