Table of Contents
Fetching ...

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

Liam Barkley, Brink van der Merwe

TL;DR

It is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage, and that simpler techniques often outperform more complex methods in reducing hallucinations.

Abstract

Large Language Models (LLMs) are powerful computational models trained on extensive corpora of human-readable text, enabling them to perform general-purpose language understanding and generation. LLMs have garnered significant attention in both industry and academia due to their exceptional performance across various natural language processing (NLP) tasks. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. Prompt engineering, the process of designing and formulating instructions for LLMs to perform specific tasks, has emerged as a key approach to mitigating hallucinations. This paper provides a comprehensive empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs. Various prompting techniques are applied to a broad set of benchmark datasets to assess the accuracy and hallucination rate of each method. Additionally, the paper investigates the influence of tool-calling agents (LLMs augmented with external tools to enhance their capabilities beyond language generation) on hallucination rates in the same benchmarks. The findings demonstrate that the optimal prompting technique depends on the type of problem, and that simpler techniques often outperform more complex methods in reducing hallucinations. Furthermore, it is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage.

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

TL;DR

It is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage, and that simpler techniques often outperform more complex methods in reducing hallucinations.

Abstract

Large Language Models (LLMs) are powerful computational models trained on extensive corpora of human-readable text, enabling them to perform general-purpose language understanding and generation. LLMs have garnered significant attention in both industry and academia due to their exceptional performance across various natural language processing (NLP) tasks. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. Prompt engineering, the process of designing and formulating instructions for LLMs to perform specific tasks, has emerged as a key approach to mitigating hallucinations. This paper provides a comprehensive empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs. Various prompting techniques are applied to a broad set of benchmark datasets to assess the accuracy and hallucination rate of each method. Additionally, the paper investigates the influence of tool-calling agents (LLMs augmented with external tools to enhance their capabilities beyond language generation) on hallucination rates in the same benchmarks. The findings demonstrate that the optimal prompting technique depends on the type of problem, and that simpler techniques often outperform more complex methods in reducing hallucinations. Furthermore, it is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage.

Paper Structure

This paper contains 21 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Average frequency over the number of correctly sampled responses per question for the and - strategies over the benchmark.
  • Figure 2: Average Top-1 to Top-5 accuracy for the and - approaches on the benchmark.
  • Figure 3: Average frequency over the number of correctly sampled responses per question for the approach over the TriviaQA benchmark.
  • Figure 4: Average Top-1 to Top-5 accuracy for the strategy on the TriviaQA benchmark.
  • Figure 5: Average accuracy per subject for the and strategies on the dataset.
  • ...and 14 more figures