Table of Contents
Fetching ...

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Robert Lakatos, Peter Pollner, Andras Hajdu, Tamas Joo

TL;DR

This work addressess domain adaptation for generative LLM knowledge systems by systematically comparing Fine-Tuning (FN) and Retrieval-Augmented Generation (RAG) across multiple models (GPT-J-6B, OPT-6.7B, LLaMA, LLaMA-2). Using datasets CORN, UB, and COVID-CORD and metrics BLEU, ROUGE, METEOR, and cosine similarity, it demonstrates that RAG-based architectures outperform FN and baselines, reducing hallucinations significantly (cosine similarity gains up to ~53%) despite FN sometimes achieving higher METEOR scores. A simple RAG architecture that leverages ID_s-style indexing for context retrieval yields the best overall results (ROUGE ~0.30, METEOR ~0.22, BLEU ~0.063, CS ~0.57). The study concludes that RAG offers a practical, scalable path for AI-driven knowledge systems, with less retraining required and easier expansion, while combining FN and RAG provides limited added value.

Abstract

The development of generative large language models (G-LLM) opened up new opportunities for the development of new types of knowledge-based systems similar to ChatGPT, Bing, or Gemini. Fine-tuning (FN) and Retrieval-Augmented Generation (RAG) are the techniques that can be used to implement domain adaptation for the development of G-LLM-based knowledge systems. In our study, using ROUGE, BLEU, METEOR scores, and cosine similarity, we compare and examine the performance of RAG and FN for the GPT-J-6B, OPT-6.7B, LlaMA, LlaMA-2 language models. Based on measurements shown on different datasets, we demonstrate that RAG-based constructions are more efficient than models produced with FN. We point out that connecting RAG and FN is not trivial, because connecting FN models with RAG can cause a decrease in performance. Furthermore, we outline a simple RAG-based architecture which, on average, outperforms the FN models by 16% in terms of the ROGUE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity. This shows the significant advantage of RAG over FN in terms of hallucination, which is not offset by the fact that the average 8% better METEOR score of FN models indicates greater creativity compared to RAG.

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

TL;DR

This work addressess domain adaptation for generative LLM knowledge systems by systematically comparing Fine-Tuning (FN) and Retrieval-Augmented Generation (RAG) across multiple models (GPT-J-6B, OPT-6.7B, LLaMA, LLaMA-2). Using datasets CORN, UB, and COVID-CORD and metrics BLEU, ROUGE, METEOR, and cosine similarity, it demonstrates that RAG-based architectures outperform FN and baselines, reducing hallucinations significantly (cosine similarity gains up to ~53%) despite FN sometimes achieving higher METEOR scores. A simple RAG architecture that leverages ID_s-style indexing for context retrieval yields the best overall results (ROUGE ~0.30, METEOR ~0.22, BLEU ~0.063, CS ~0.57). The study concludes that RAG offers a practical, scalable path for AI-driven knowledge systems, with less retraining required and easier expansion, while combining FN and RAG provides limited added value.

Abstract

The development of generative large language models (G-LLM) opened up new opportunities for the development of new types of knowledge-based systems similar to ChatGPT, Bing, or Gemini. Fine-tuning (FN) and Retrieval-Augmented Generation (RAG) are the techniques that can be used to implement domain adaptation for the development of G-LLM-based knowledge systems. In our study, using ROUGE, BLEU, METEOR scores, and cosine similarity, we compare and examine the performance of RAG and FN for the GPT-J-6B, OPT-6.7B, LlaMA, LlaMA-2 language models. Based on measurements shown on different datasets, we demonstrate that RAG-based constructions are more efficient than models produced with FN. We point out that connecting RAG and FN is not trivial, because connecting FN models with RAG can cause a decrease in performance. Furthermore, we outline a simple RAG-based architecture which, on average, outperforms the FN models by 16% in terms of the ROGUE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity. This shows the significant advantage of RAG over FN in terms of hallucination, which is not offset by the fact that the average 8% better METEOR score of FN models indicates greater creativity compared to RAG.
Paper Structure (13 sections, 3 equations, 2 figures, 4 tables)

This paper contains 13 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Radar plot of the evaluation results of the models.
  • Figure 2: Flow diagram of the RAG model (best approach) that uses a search engine based on the vectorial embedding of sentences.