Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Jiarui Li; Ye Yuan; Zehua Zhang

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Jiarui Li, Ye Yuan, Zehua Zhang

TL;DR

The paper tackles LLM hallucinations in domain-specific Q&A by integrating a Retrieval Augmented Generation (RAG) pipeline with a CMU/LTI-focused external dataset. It introduces an end-to-end system, including web data collection, automated QA annotation, embedding and reranking, and core generation with LLaMA-2, coupled with thorough ablations and case studies. Key contributions are the CMU/LTI dataset construction, a state-of-the-art RAG pipeline tailored to knowledge-intensive tasks, and a rigorous evaluation showing improved factual accuracy while highlighting limitations of small, biased datasets. The work demonstrates the practical potential of external data augmentation for domain-specific QA and provides a reproducible framework for future knowledge-intensive NLP systems.

Abstract

We proposed an end-to-end system design towards utilizing Retrieval Augmented Generation (RAG) to improve the factual accuracy of Large Language Models (LLMs) for domain-specific and time-sensitive queries related to private knowledge-bases. Our system integrates RAG pipeline with upstream datasets processing and downstream performance evaluation. Addressing the challenge of LLM hallucinations, we finetune models with a curated dataset which originates from CMU's extensive resources and annotated with the teacher model. Our experiments demonstrate the system's effectiveness in generating more accurate answers to domain-specific and time-sensitive inquiries. The results also revealed the limitations of fine-tuning LLMs with small-scale and skewed datasets. This research highlights the potential of RAG systems in augmenting LLMs with external datasets for improved performance in knowledge-intensive tasks. Our code and models are available on Github.

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

TL;DR

Abstract

Paper Structure (28 sections, 7 equations, 4 figures, 1 table)

This paper contains 28 sections, 7 equations, 4 figures, 1 table.

Introduction
System Overview
Dataset Creation
Web Crawler
Data Crawling
Data Organization and Post-processing
Research Papers
Annotation Automation
Dataset Evaluation
Question-Answering Pipeline
Embedding Model
Reranking Model
Core Model
Experiments
Setup
...and 13 more sections

Figures (4)

Figure 1: Overview of our system design. The process begins with a Web Crawler, which consists of a Recursive Crawler and a Filter to gather raw data. This raw data is sanitized and stored in an S3 storage bucket. The Dataset Generator relies on an Annotation Model to prompt annotations and filter datapoints, generating finetune datasets for all models in the system, i.e. Annotation, Embedding, Reranking, and Core Models. For query processing, the Context Retriever embeds the user question and retrieves relevant contexts, which are re-ranked and rewritten if necessary. The Generation module then utilizes the Core Model and a Prompt Template to generate a system answer through a QA Chain, resulting in an answer that leverages both retrieved information and generative capabilities for accurate and context-aware responses.
Figure 2: Hierarchical knowledge base file system keeps the structural information of the relation between original files, enrich the semantic providing to the retriever.
Figure 3: Overview of RAG QA Pipeline, which can be divided into retrieval phase and generation phase. In the retrieval phase, the retriever fetches top 5 reference chunks with maximum similarity in terms of mmr score, which is then sent to reranking model to prioritizing the most relevant information for the given user inquery. In the generation phase, the generative model takes the rewritten prompt as input and completes the answer.
Figure 4: Recall, F1 Score, Cosine Similarity and BELU on our local test question-answer dataset under different settings. Note the data in the chart is normalized between 0 and 1 for better visibility. For original experiment output please refer to Table \ref{['tab:performance_metrics']}.

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

TL;DR

Abstract

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Authors

TL;DR

Abstract

Table of Contents

Figures (4)