Table of Contents
Fetching ...

Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature

Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda

TL;DR

The paper tackles how retrieval augmented generation impacts QA quality across computer science literature by benchmarking five open‑source LLMs and GPT‑3.5 on a carefully constructed abstract‑level dataset. It deploys a LangChain based RAG pipeline with SPECTER embeddings and FAISS, evaluating binary and long‑form questions with accuracy, precision, cosine similarity, and human/Gemini rankings, plus latency and cost. Key findings show GPT‑3.5+RAG delivering the strongest binary QA performance and Mistral‑7b‑instruct+RAG leading among open‑source models, while latency varies considerably across platforms. The work highlights practical deployment considerations and outlines directions for extending the study to full texts, alternative retrieval strategies, and broader domains.

Abstract

Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.

Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature

TL;DR

The paper tackles how retrieval augmented generation impacts QA quality across computer science literature by benchmarking five open‑source LLMs and GPT‑3.5 on a carefully constructed abstract‑level dataset. It deploys a LangChain based RAG pipeline with SPECTER embeddings and FAISS, evaluating binary and long‑form questions with accuracy, precision, cosine similarity, and human/Gemini rankings, plus latency and cost. Key findings show GPT‑3.5+RAG delivering the strongest binary QA performance and Mistral‑7b‑instruct+RAG leading among open‑source models, while latency varies considerably across platforms. The work highlights practical deployment considerations and outlines directions for extending the study to full texts, alternative retrieval strategies, and broader domains.

Abstract

Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.

Paper Structure

This paper contains 10 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Conceptual framework
  • Figure 2: Conversion process of the abstracts
  • Figure 3: Overview of SPECTER vectorisation
  • Figure 4: Architecture of QA pipeline
  • Figure 5: Prompt designed for GPT-3.5 to query about quantum computing