Table of Contents
Fetching ...

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

Philip Feldman, James R. Foulds, Shimei Pan

TL;DR

The paper investigates hallucinations in large language models and evaluates Retrieval-Augmented Generation (RAG) as a remedy by injecting external context via retrieved documents. Using a human-in-the-loop CV-based evaluation with prompts that include or exclude context, the study shows context-enabled RAG can boost accuracy to about 94% compared to 7.3% without context, yet about 6% of responses remain incorrect due to factors like noisy context, misalignment, and formatting issues. The authors categorize errors into five types, revealing that even accurate context can produce credible but wrong outputs, highlighting limitations in current RAG approaches. The findings underscore the importance of context quality, prompt design, and user understanding for deploying trustworthy, context-aware LLM systems in real-world applications.

Abstract

Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However, their tendency to hallucinate -- generate plausible but false information -- poses a significant challenge. This issue is critical, as seen in recent court cases where ChatGPT's use led to citations of non-existent legal rulings. This paper explores how Retrieval-Augmented Generation (RAG) can counter hallucinations by integrating external knowledge with prompts. We empirically evaluate RAG against standard LLMs using prompts designed to induce hallucinations. Our results show that RAG increases accuracy in some cases, but can still be misled when prompts directly contradict the model's pre-trained understanding. These findings highlight the complex nature of hallucinations and the need for more robust solutions to ensure LLM reliability in real-world applications. We offer practical recommendations for RAG deployment and discuss implications for the development of more trustworthy LLMs.

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

TL;DR

The paper investigates hallucinations in large language models and evaluates Retrieval-Augmented Generation (RAG) as a remedy by injecting external context via retrieved documents. Using a human-in-the-loop CV-based evaluation with prompts that include or exclude context, the study shows context-enabled RAG can boost accuracy to about 94% compared to 7.3% without context, yet about 6% of responses remain incorrect due to factors like noisy context, misalignment, and formatting issues. The authors categorize errors into five types, revealing that even accurate context can produce credible but wrong outputs, highlighting limitations in current RAG approaches. The findings underscore the importance of context quality, prompt design, and user understanding for deploying trustworthy, context-aware LLM systems in real-world applications.

Abstract

Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However, their tendency to hallucinate -- generate plausible but false information -- poses a significant challenge. This issue is critical, as seen in recent court cases where ChatGPT's use led to citations of non-existent legal rulings. This paper explores how Retrieval-Augmented Generation (RAG) can counter hallucinations by integrating external knowledge with prompts. We empirically evaluate RAG against standard LLMs using prompts designed to induce hallucinations. Our results show that RAG increases accuracy in some cases, but can still be misled when prompts directly contradict the model's pre-trained understanding. These findings highlight the complex nature of hallucinations and the need for more robust solutions to ensure LLM reliability in real-world applications. We offer practical recommendations for RAG deployment and discuss implications for the development of more trustworthy LLMs.
Paper Structure (11 sections, 1 figure, 1 table)