Table of Contents
Fetching ...

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

Heydar Soudani, Evangelos Kanoulas, Faegheh Hasibi

TL;DR

The paper compares retrieval augmented generation (RAG) and fine-tuning (FT) for injecting less-popular factual knowledge into language models. It experiments across twelve LMs and multiple data augmentation and retrieval configurations, showing that RAG typically outperforms FT on long-tail knowledge, while FT helps across popularity levels and small models. A key contribution is Stimulus RAG (SRAG), a lightweight RAG variant that uses a retrieved hint to steer generation and surpasses FT-based approaches without costly fine-tuning. The findings offer practical guidance for domain-specific QA in low-resource settings and demonstrate that retrieval quality substantially influences performance.

Abstract

Language Models (LMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LMs on low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LMs in handling low-frequency entities on question answering tasks. We conduct extensive experiments on twelve LMs of varying size and type and different fine tuning, data augmentation, and retrieval models. Our findings indicate that while FT boosts the performance across entities of varying popularity, RAG surpasses FT by a large margin particularly for least popular factual knowledge. Additionally, the success of both RAG and FT approaches is amplified by improving retrieval and data augmentation techniques. Fine tuning, while beneficial for small LMs, requires extensive resources. To address this issue, we propose the new Stimulus RAG approach that surpasses the effectiveness of fine tuning based approaches, thereby eliminating the need for the costly data augmentation and fine tuning step for enriching LMs with less popular factual knowledge. The code is available at \url{https://github.com/informagi/RAGvsFT}.

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

TL;DR

The paper compares retrieval augmented generation (RAG) and fine-tuning (FT) for injecting less-popular factual knowledge into language models. It experiments across twelve LMs and multiple data augmentation and retrieval configurations, showing that RAG typically outperforms FT on long-tail knowledge, while FT helps across popularity levels and small models. A key contribution is Stimulus RAG (SRAG), a lightweight RAG variant that uses a retrieved hint to steer generation and surpasses FT-based approaches without costly fine-tuning. The findings offer practical guidance for domain-specific QA in low-resource settings and demonstrate that retrieval quality substantially influences performance.

Abstract

Language Models (LMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LMs on low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LMs in handling low-frequency entities on question answering tasks. We conduct extensive experiments on twelve LMs of varying size and type and different fine tuning, data augmentation, and retrieval models. Our findings indicate that while FT boosts the performance across entities of varying popularity, RAG surpasses FT by a large margin particularly for least popular factual knowledge. Additionally, the success of both RAG and FT approaches is amplified by improving retrieval and data augmentation techniques. Fine tuning, while beneficial for small LMs, requires extensive resources. To address this issue, we propose the new Stimulus RAG approach that surpasses the effectiveness of fine tuning based approaches, thereby eliminating the need for the costly data augmentation and fine tuning step for enriching LMs with less popular factual knowledge. The code is available at \url{https://github.com/informagi/RAGvsFT}.
Paper Structure (12 sections, 4 equations, 8 figures, 7 tables)

This paper contains 12 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of RAG and fine-tuning on StableLM2 performance in question answering over factual knowledge. RAG-based approaches significantly enhance the performance of the vanilla StableLM2, outperforming fine-tuning by a large margin. Our proposed SRAG approach outperforms all models, including the fine-tuning based approaches.
  • Figure 2: Overview of parametric and non-parametric knowledge injection for less popular factual knowledge. First, we prepare the corpus. Next, we generate knowledge in two formats: textual documents and synthetic QA pairs. Finally, we inject the knowledge into the prompt or LM parameters.
  • Figure 3: Input prompt for prompt-based QA pair generation. We define a CoT prompt to outline the generation steps.
  • Figure 4: Our proposed Stimulus RAG method. The Hint Extractor identifies the most relevant sentence from top-K documents ranked by the retriever. This sentence is then added to the beginning of the input prompt.
  • Figure 5: Distribution of sample counts across popularity buckets, defined by $log_{10}(\text{pageviews})$ for PopQA and WitQA and $log_{2}(\text{pageviews})$ for EQ.
  • ...and 3 more figures