Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
Simone Alghisi, Massimo Rizzoli, Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi
TL;DR
This work systematically compares in-context learning and LoRA-based fine-tuning for adapting LLMs to four dialogue types (Open-Domain, Knowledge-Grounded, Task-Oriented, and QA) and examines grounding through Retrieval-Augmented Generation versus gold documents. Using two base models (Llama2_C and Mistral_I) and consistent automatic and human evaluations, the study shows that no single adaptation technique dominates across all settings; outcomes depend on both the base model and the dialogue type. An explainability analysis using integrated gradients and a thorough human evaluation reveal that automatic metrics can be unreliable and that grounding effectiveness hinges on retriever quality and knowledge representation. The findings underscore the importance of human judgments in evaluating dialogue systems and offer practical guidance for selecting adaptation strategies in real-world deployments, while highlighting the need for larger models and more robust evaluation protocols in future work.
Abstract
We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.
