Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

Neeraj Varshney; Satyam Raj; Venkatesh Mishra; Agneet Chatterjee; Ritika Sarkar; Amir Saeidi; Chitta Baral

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

Neeraj Varshney, Satyam Raj, Venkatesh Mishra, Agneet Chatterjee, Ritika Sarkar, Amir Saeidi, Chitta Baral

TL;DR

The paper tackles hallucinations in LLMs specifically arising from negation by introducing four negation-focused evaluation tasks: False Premise Completion, Constrained Fact Generation, Multiple-Choice QA, and Fact Generation. It benchmarks open-source 13B models (LLaMA-2-chat, Vicuna, Orca-2) and reveals substantial hallucinations across false-premise prompts and, in FG, amplified hallucinations when negation is present. The authors explore mitigation strategies—cautionary instructions, in-context exemplars, self-refinement, and knowledge augmentation—with nuanced outcomes: in-context cues (Inst+Exemp) reduce hallucinations most effectively, while knowledge augmentation often increases false-premise hallucinations but helps with correct-premise prompts. The findings highlight a critical shortcoming in current LLMs’ handling of negation and emphasize trade-offs in mitigation approaches, suggesting directions for more robust negation-aware generation and multilingual extension.

Abstract

Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, they have been shown to suffer from a critical limitation pertinent to 'hallucination' in their output. Recent research has focused on investigating and addressing this problem for a variety of tasks such as biography generation, question answering, abstractive summarization, and dialogue generation. However, the crucial aspect pertaining to 'negation' has remained considerably underexplored. Negation is important because it adds depth and nuance to the understanding of language and is also crucial for logical reasoning and inference. In this work, we address the above limitation and particularly focus on studying the impact of negation in LLM hallucinations. Specifically, we study four tasks with negation: 'false premise completion', 'constrained fact generation', 'multiple choice question answering', and 'fact generation'. We show that open-source state-of-the-art LLMs such as LLaMA-2-chat, Vicuna, and Orca-2 hallucinate considerably on all these tasks involving negation which underlines a critical shortcoming of these models. Addressing this problem, we further study numerous strategies to mitigate these hallucinations and demonstrate their impact.

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

TL;DR

Abstract

Paper Structure (37 sections, 1 equation, 4 figures, 18 tables)

This paper contains 37 sections, 1 equation, 4 figures, 18 tables.

Introduction
Related Work
Evaluation Tasks
False Premise Completion (FPC)
Rationale:
Constrained Fact Generation (CFG)
Rationale:
Multiple-Choice QA (MCQA)
Rationale:
Fact Generation (FG)
Rationale:
Experiments and Results
False Premise Completion
Performance Evaluation:
Performance of Models
...and 22 more sections

Figures (4)

Figure 1: Illustration of the four tasks that deal with negation studied in this work. Responses enclosed in red boxes (marked with ✗) are hallucinations while those in green boxes (marked with ✓) are factually correct.
Figure 2: Impact of various mitigation strategies with LLaMA-2 model on the Prompt Completion task. We show performance on both false premise prompts and correct premise prompts.
Figure 3: Performance of models on the FG task with negation (w/ neg) and without negation (w/o neg).
Figure 4: Domain-wise performance of LLaMA-2 on the FG task with negation and without negation.

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

TL;DR

Abstract

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)