Table of Contents
Fetching ...

A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Florin Cuconasu, Giovanni Trappolini, Nicola Tonellotto, Fabrizio Silvestri

TL;DR

Contrary to popular belief, this study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under experimental settings, which challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications.

Abstract

Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".

A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

TL;DR

Contrary to popular belief, this study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under experimental settings, which challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications.

Abstract

Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".
Paper Structure (35 sections, 2 equations, 9 figures, 6 tables)

This paper contains 35 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Base vs. Instruct + Template under Task Instruction I on TriviaQA. The figure presents a comparison between the responses generated by two versions of the Llama 2 7B model: the base version and the instruct + template version. Each version is tasked with answering the same question based on the provided documents. The base model correctly identifies the answer as "Burgess Meredith", whereas the instruct + template version incorrectly attributes the answer to "Danny DeVito". Italic text denotes the template.
  • Figure 2: Recalling from Parametric Memory - Llama 2 7B - TriviaQA. Reported is the recall from parametric memory rate, defined as the number of instances where the model correctly answers despite the retrieved documents not containing the correct answer, divided by the number of times the answer is not present in the context. (left) Task Instruction I as shown in Figure \ref{['fig:base_vs_instruct']}; (right) No Rejection setting, where we do not specify to answer with NO-RES when the answer is not contained in the retrieved documents (example in Figure \ref{['fig:base_vs_instruct_no_rej']}). In this case, the parametric memory rate increases for both model versions.
  • Figure 3: Negative Rejection Rate - Llama 2 7B - TriviaQA. Reported is the negative rejection rate, that is, the number of times the model answers NO-RES when the correct answer is not in the context, divided by the number of times the answer is indeed missing. Instruct models are much more effective at detecting such cases and following the instructions provided.
  • Figure 4: Base vs. Instruct + Template under Task Instruction II on NQ. This comparison of responses between the base and instruct + template versions of Mistral 7B illustrates an example where the base model correctly identifies the answer, while the instruct + template version erroneously opts for a NO-RES response, despite the correct answer being present in the documents. Italic text denotes the template.
  • Figure 5: Base vs. Instruct + Template under Task Instruction II on TriviaQA. This comparison of responses between the base and instruct + template versions of Llama 2 7B illustrates an example where the base model correctly identifies the answer, while the instruct + template version inaccurately attributes the answer to a different actor. Nevertheless, in both cases, the answers are "coherent" with the Proof since each evidence contains the generated answer. Italic text denotes the template.
  • ...and 4 more figures