Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Robert Lakatos; Peter Pollner; Andras Hajdu; Tamas Joo

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Robert Lakatos, Peter Pollner, Andras Hajdu, Tamas Joo

TL;DR

This work addressess domain adaptation for generative LLM knowledge systems by systematically comparing Fine-Tuning (FN) and Retrieval-Augmented Generation (RAG) across multiple models (GPT-J-6B, OPT-6.7B, LLaMA, LLaMA-2). Using datasets CORN, UB, and COVID-CORD and metrics BLEU, ROUGE, METEOR, and cosine similarity, it demonstrates that RAG-based architectures outperform FN and baselines, reducing hallucinations significantly (cosine similarity gains up to ~53%) despite FN sometimes achieving higher METEOR scores. A simple RAG architecture that leverages ID_s-style indexing for context retrieval yields the best overall results (ROUGE ~0.30, METEOR ~0.22, BLEU ~0.063, CS ~0.57). The study concludes that RAG offers a practical, scalable path for AI-driven knowledge systems, with less retraining required and easier expansion, while combining FN and RAG provides limited added value.

Abstract

The development of generative large language models (G-LLM) opened up new opportunities for the development of new types of knowledge-based systems similar to ChatGPT, Bing, or Gemini. Fine-tuning (FN) and Retrieval-Augmented Generation (RAG) are the techniques that can be used to implement domain adaptation for the development of G-LLM-based knowledge systems. In our study, using ROUGE, BLEU, METEOR scores, and cosine similarity, we compare and examine the performance of RAG and FN for the GPT-J-6B, OPT-6.7B, LlaMA, LlaMA-2 language models. Based on measurements shown on different datasets, we demonstrate that RAG-based constructions are more efficient than models produced with FN. We point out that connecting RAG and FN is not trivial, because connecting FN models with RAG can cause a decrease in performance. Furthermore, we outline a simple RAG-based architecture which, on average, outperforms the FN models by 16% in terms of the ROGUE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity. This shows the significant advantage of RAG over FN in terms of hallucination, which is not offset by the fact that the average 8% better METEOR score of FN models indicates greater creativity compared to RAG.

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

TL;DR

Abstract

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Authors

TL;DR

Abstract

Table of Contents

Figures (2)