Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Simone Alghisi; Massimo Rizzoli; Gabriel Roccabruna; Seyed Mahed Mousavi; Giuseppe Riccardi

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Simone Alghisi, Massimo Rizzoli, Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi

TL;DR

This work systematically compares in-context learning and LoRA-based fine-tuning for adapting LLMs to four dialogue types (Open-Domain, Knowledge-Grounded, Task-Oriented, and QA) and examines grounding through Retrieval-Augmented Generation versus gold documents. Using two base models (Llama2_C and Mistral_I) and consistent automatic and human evaluations, the study shows that no single adaptation technique dominates across all settings; outcomes depend on both the base model and the dialogue type. An explainability analysis using integrated gradients and a thorough human evaluation reveal that automatic metrics can be unreliable and that grounding effectiveness hinges on retriever quality and knowledge representation. The findings underscore the importance of human judgments in evaluating dialogue systems and offer practical guidance for selecting adaptation strategies in real-world deployments, while highlighting the need for larger models and more robust evaluation protocols in future work.

Abstract

We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

TL;DR

Abstract

Paper Structure (19 sections, 5 figures, 9 tables)

This paper contains 19 sections, 5 figures, 9 tables.

Introduction
Literature Review
Experiments
Datasets
Techniques
Knowledge
Models
Evaluation
Automatic Evaluation
Explainability Study
Human Evaluation
Explaining Negative Human Judgments
Conclusion
Appendix
Datasets
...and 4 more sections

Figures (5)

Figure 1: Percentage of LLM responses (y-axis) for each error type (Not Contextualized and Not Appropriate) and their explanation (Generic, Hallucinated, and Incoherent) (x-axis), for Llama2$_{C}$ and Mistral$_{I}$, adapted with In-Context Learning and Fine-Tuning in Open-Domain Dialogues (ODDs).
Figure 2: Percentage of LLM responses (y-axis) for each error type (Not Contextualized and Not Appropriate) and their explanation (Generic, Hallucinated, and Incoherent) (x-axis), for Llama2$_{C}$ and Mistral$_{I}$, adapted with In-Context Learning and Fine-Tuning in Knowledge-Grounded Dialogues (KGDs).
Figure 3: Percentage of LLM responses (y-axis) for each error type (Not Contextualized and Not Appropriate) and their explanation (Generic, Hallucinated, Incoherent, and Unhelpful) (x-axis), for Llama2$_{C}$ and Mistral$_{I}$, adapted with In-Context Learning and Fine-Tuning in Task-Oriented Dialogues (TODs).
Figure 4: Percentage of LLM responses (y-axis) for each error type (Not Contextualized) and their explanation (Generic, and Hallucinated) (x-axis), for Llama2$_{C}$ and Mistral$_{I}$, adapted with In-Context Learning and Fine-Tuning in Question Answering (QA).
Figure 5: Performance of the off-the-shelf retriever for each dialogue type. The retriever achieves the lowest Recall@K on TOD because of the larger knowledge base size (2900 documents). However, the retriever achieves a higher Recall@K for QA, even though its knowledge base is bigger than the one for KGD (355 vs. 61 $\pm$ 21). Further studies indicate that, despite the model is not capable to retrieve the exact sentence of the annotator (KGD Sentence), the retriever selects a sentence belonging to the same paragraph more than 69% of the time (KGD Paragraph).

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

TL;DR

Abstract

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Authors

TL;DR

Abstract

Table of Contents

Figures (5)