Table of Contents
Fetching ...

Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

Maria Paola Priola

TL;DR

The paper tackles hallucinations in healthcare-focused LLMs by integrating retrieval-grounded generation (RAG) with a context-aware evaluation framework (NMISS). NMISS blends reference overlap with contextual grounding to distinguish genuine hallucinations from contextually valid elaborations in a lightweight, black-box fashion. Through experiments on a large Italian health-news corpus and multiple models (Gemma, Gemma-2, GPT-4, LLaMA-2/3, Mistral), the study demonstrates that RAG improves factual grounding while NMISS reveals nuanced behavior not captured by traditional metrics, especially for mid-tier models. The work contributes a new Italian healthcare QA dataset and demonstrates practical implications for multilingual and domain-specific applications, with future work extending NMISS to multi-hop and adaptive fine-tuning contexts.

Abstract

I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.

Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

TL;DR

The paper tackles hallucinations in healthcare-focused LLMs by integrating retrieval-grounded generation (RAG) with a context-aware evaluation framework (NMISS). NMISS blends reference overlap with contextual grounding to distinguish genuine hallucinations from contextually valid elaborations in a lightweight, black-box fashion. Through experiments on a large Italian health-news corpus and multiple models (Gemma, Gemma-2, GPT-4, LLaMA-2/3, Mistral), the study demonstrates that RAG improves factual grounding while NMISS reveals nuanced behavior not captured by traditional metrics, especially for mid-tier models. The work contributes a new Italian healthcare QA dataset and demonstrates practical implications for multilingual and domain-specific applications, with future work extending NMISS to multi-hop and adaptive fine-tuning contexts.

Abstract

I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.

Paper Structure

This paper contains 21 sections, 23 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: News Frequency Aggregated by Year
  • Figure 2: User-Agent Application Workflow
  • Figure 3: ROUGE Metrics by Question Levels
  • Figure 4: BLEU, METEOR and EXACT MATCH Metrics by Question Levels
  • Figure 5: NMISS Utility vs. Hallucination Rate (BLEU)
  • ...and 2 more figures