Table of Contents
Fetching ...

THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven

TL;DR

THaMES delivers an end-to-end toolkit for hallucination evaluation and mitigation in LLMs by integrating automated testset generation from arbitrary corpora, multifaceted benchmarking, and diverse mitigation strategies. It constructs a 2,100-question QA benchmark with both correct and hallucinated answers and ranks outputs using an Ensemble Score defined as $Ensemble Score = Entailment Score + Factual Consistency Score$ to quantify hallucination severity; test set construction employs a weighted sampling scheme with $p_i = w_i / \sum_j w_j$, $w_i = 1/(c_i+1)$. The framework evaluates detection and generation across models using metrics from RAGAS and a hallucination identification accuracy test, then compares In-Context Learning (CoVe), RAG, and LoRA-based PEFT across models such as GPT-4o, GPT-4o-mini, Llama-3.1-8B-Instruct, and Mistral-Nemo. Empirical results show model-dependent benefits, with RAG significantly aiding GPT-4o and ICL aiding Llama-3.1-8B-Instruct, and PEFT yielding notable improvements for Llama-3.1, positioning THaMES as a standardized, domain-flexible tool for safer LLM deployment and development.

Abstract

Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.

THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

TL;DR

THaMES delivers an end-to-end toolkit for hallucination evaluation and mitigation in LLMs by integrating automated testset generation from arbitrary corpora, multifaceted benchmarking, and diverse mitigation strategies. It constructs a 2,100-question QA benchmark with both correct and hallucinated answers and ranks outputs using an Ensemble Score defined as to quantify hallucination severity; test set construction employs a weighted sampling scheme with , . The framework evaluates detection and generation across models using metrics from RAGAS and a hallucination identification accuracy test, then compares In-Context Learning (CoVe), RAG, and LoRA-based PEFT across models such as GPT-4o, GPT-4o-mini, Llama-3.1-8B-Instruct, and Mistral-Nemo. Empirical results show model-dependent benefits, with RAG significantly aiding GPT-4o and ICL aiding Llama-3.1-8B-Instruct, and PEFT yielding notable improvements for Llama-3.1, positioning THaMES as a standardized, domain-flexible tool for safer LLM deployment and development.

Abstract

Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
Paper Structure (37 sections, 4 equations, 3 figures, 8 tables)

This paper contains 37 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: System Diagram of the THaMES Framework, including QA set generation, hallucination benchmarking, and mitigation strategies.
  • Figure 2: Box Plot showing Uniform Effectiveness of Different Sampling Methods
  • Figure 3: Retrieval Counts by Node ID