Table of Contents
Fetching ...

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu

TL;DR

The paper assesses whether independent, joint, or two-phase fine-tuning strategies for retrieval-augmented generation yield different end-to-end gains. Through controlled experiments across multiple RAG pipelines on HotPotQA and PopQA, all strategies achieve similar improvements in end-to-end metrics like EM and F1, but differ in computational cost and labeling requirements. The findings suggest practical guidance: use independent fine-tuning when context labels exist; otherwise, opt for joint or two-phase fine-tuning depending on labeling availability and LR search needs. Overall, the work clarifies trade-offs between compute and performance and highlights when each strategy is advantageous in real-world RAG deployments.

Abstract

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

TL;DR

The paper assesses whether independent, joint, or two-phase fine-tuning strategies for retrieval-augmented generation yield different end-to-end gains. Through controlled experiments across multiple RAG pipelines on HotPotQA and PopQA, all strategies achieve similar improvements in end-to-end metrics like EM and F1, but differ in computational cost and labeling requirements. The findings suggest practical guidance: use independent fine-tuning when context labels exist; otherwise, opt for joint or two-phase fine-tuning depending on labeling availability and LR search needs. Overall, the work clarifies trade-offs between compute and performance and highlights when each strategy is advantageous in real-world RAG deployments.

Abstract

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

Paper Structure

This paper contains 12 sections, 7 figures.

Figures (7)

  • Figure 1: RAG fine-tuning strategy subprocesses. Each of the RAG fine-tuning strategies discussed in this paper uses a combination of these subprocesses. Key: Question, Context, Answer, Embedding model, Generator model.
  • Figure 2: Validation performance metrics and time to fine-tune for different fine-tuning strategies, averaged across all four RAG pipelines and both HotPotQA and PopQA datasets.
  • Figure 3: HotPotQA and PopQA validation performance metrics after fine-tuning and time to fine-tune for different fine-tuning strategies, averaged across all four RAG pipelines.
  • Figure 4: Validation loss convergence plot for fine-tuning a RAG pipeline consisting of a MiniLM embedding model and LLaMA-3-8b generator model on HotPotQA with joint fine-tuning. The validation loss converges quickly during fine-tuning, well within the 1 epoch fine-tuning period.
  • Figure 5: Number of parameters in each model used in this paper.
  • ...and 2 more figures