Table of Contents
Fetching ...

What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs' Self-consistency Via Adversarial Nudge

Arka Dutta, Sujan Dutta, Rijul Magu, Soumyajit Datta, Munmun De Choudhury, Ashiqur R. KhudaBukhsh

TL;DR

This work addresses the challenge of factuality hallucinations in large language models under adversarial nudges. It introduces HAUNT, a dynamic, ground-truth-free audit framework that generates self-produced truths and lies, verifies them, and then tests model responses against nudges in closed domains. Across five proprietary model families and two domains (movies and books), the study reveals substantial variation in resilience, with Claude showing strong robustness while Gemini and DeepSeek exhibit notable vulnerabilities. The findings highlight the importance of robust factuality safeguards and motivate future work on broader domain coverage and mitigation strategies for real-world LLM deployments.

Abstract

Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.

What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs' Self-consistency Via Adversarial Nudge

TL;DR

This work addresses the challenge of factuality hallucinations in large language models under adversarial nudges. It introduces HAUNT, a dynamic, ground-truth-free audit framework that generates self-produced truths and lies, verifies them, and then tests model responses against nudges in closed domains. Across five proprietary model families and two domains (movies and books), the study reveals substantial variation in resilience, with Claude showing strong robustness while Gemini and DeepSeek exhibit notable vulnerabilities. The findings highlight the importance of robust factuality safeguards and motivate future work on broader domain coverage and mitigation strategies for real-world LLM deployments.

Abstract

Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.

Paper Structure

This paper contains 33 sections, 19 figures, 22 tables.

Figures (19)

  • Figure 1: An illustrative conversational sketch between a user and GPT showing GPT's susceptibility to factuality hallucinations in the presence of nudge.
  • Figure 2: Experimental setup for step three: nudging LLM with lies drawn from $\mathcal{T}\mathcal{\&}\mathcal{L}$ datasets . For a given LLM $\mathcal{M}$ and movie $m$ (or book $b$), $lie\_1$ and $lie\_2$ denote the first and second lie generated by $\mathcal{M}$ for $m$ (or $b$) at step one: generating truths and lies.
  • Figure 3: Percentage of I don't know responses.
  • Figure 4: Cohen's $\kappa$ agreement between model verifications on truths and lies across three subsets.
  • Figure 5: Percentage of hallucinations after nudging using the experimental setup described in Figure \ref{['fig:adversarialnudge']} on $\mathcal{D}_\textit{movies}$. LLM responses are evaluated with Mistral-Large-LatestMistralAI2024Large (F1 score of 0.93 on a human-annotated eval set). Appendix \ref{['sec:MistralVerification']} contains further details. Self-verified indicates the percentage of lies that are verified as inaccurate by the LLM during step two: verifying truths and lies.
  • ...and 14 more figures