Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda
TL;DR
The paper tackles how retrieval augmented generation impacts QA quality across computer science literature by benchmarking five open‑source LLMs and GPT‑3.5 on a carefully constructed abstract‑level dataset. It deploys a LangChain based RAG pipeline with SPECTER embeddings and FAISS, evaluating binary and long‑form questions with accuracy, precision, cosine similarity, and human/Gemini rankings, plus latency and cost. Key findings show GPT‑3.5+RAG delivering the strongest binary QA performance and Mistral‑7b‑instruct+RAG leading among open‑source models, while latency varies considerably across platforms. The work highlights practical deployment considerations and outlines directions for extending the study to full texts, alternative retrieval strategies, and broader domains.
Abstract
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.
