Table of Contents
Fetching ...

The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances

Yining Wang, Yuquan Wang, Xi Li, Mi Zhang, Geng Hong, Min Yang

TL;DR

The paper investigates factual hallucinations in large language models by introducing The Illusionist's Prompt, a black-box adversarial method that uses six linguistic mutation guidelines to rewrite normal prompts into semantically preserved but deceptive variants. This approach increases semantic entropy during inference and bypasses five categories of fact-enhancing defenses, yielding more factual errors across both open and commercial LLMs. Through extensive experiments on the TruthfulQA benchmark and multiple models, the authors demonstrate transferable adversarial effects, including across GPT-4o and Gemini-2.0, and provide ablations, baselines, and adaptive mitigation analyses. The work highlights a significant vulnerability in current factuality mitigation strategies and motivates future defenses that account for linguistic nuance and cross-model transferability, with implications for downstream applications and data integrity.

Abstract

As Large Language Models (LLMs) continue to advance, they are increasingly relied upon as real-time sources of information by non-expert users. To ensure the factuality of the information they provide, much research has focused on mitigating hallucinations in LLM responses, but only in the context of formal user queries, rather than maliciously crafted ones. In this study, we introduce The Illusionist's Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries, challenging the factual accuracy of LLMs against five types of fact-enhancing strategies. Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics. Extensive experiments confirm the effectiveness of our attack in compromising black-box LLMs, including commercial APIs like GPT-4o and Gemini-2.0, even with various defensive mechanisms.

The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances

TL;DR

The paper investigates factual hallucinations in large language models by introducing The Illusionist's Prompt, a black-box adversarial method that uses six linguistic mutation guidelines to rewrite normal prompts into semantically preserved but deceptive variants. This approach increases semantic entropy during inference and bypasses five categories of fact-enhancing defenses, yielding more factual errors across both open and commercial LLMs. Through extensive experiments on the TruthfulQA benchmark and multiple models, the authors demonstrate transferable adversarial effects, including across GPT-4o and Gemini-2.0, and provide ablations, baselines, and adaptive mitigation analyses. The work highlights a significant vulnerability in current factuality mitigation strategies and motivates future defenses that account for linguistic nuance and cross-model transferability, with implications for downstream applications and data integrity.

Abstract

As Large Language Models (LLMs) continue to advance, they are increasingly relied upon as real-time sources of information by non-expert users. To ensure the factuality of the information they provide, much research has focused on mitigating hallucinations in LLM responses, but only in the context of formal user queries, rather than maliciously crafted ones. In this study, we introduce The Illusionist's Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries, challenging the factual accuracy of LLMs against five types of fact-enhancing strategies. Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics. Extensive experiments confirm the effectiveness of our attack in compromising black-box LLMs, including commercial APIs like GPT-4o and Gemini-2.0, even with various defensive mechanisms.

Paper Structure

This paper contains 34 sections, 2 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: An illustration of normal queries and adversarially crafted prompts. Top: the adversarial prompt of previous attack yao2023llm, Middle: the normal user query and model response, Bottom: the illusionist's prompt by our proposed attack and model response. The factual errors in model responses are underlined.
  • Figure 2: The comparison of semantic entropy and semantic similarity between the original and three types of mutated prompts. The abbreviations read., form., and conc. refer to linguistic mutations towards reducing readability, formality, and concreteness respectively. The semantic entropy is calculated using the official implementation with LLaMA-2-7B, while semantic similarity is computed as in Sec \ref{['sec:5.1']}.
  • Figure 3: Results of the factuality evaluation on MC task. The postfix -greedy, and -nucleus represent greedy search and nucleus sampling decoding respectively.
  • Figure 4: The guidance template for The Illusionist's Prompt.
  • Figure 5: The evaluation prompt for factual hallucinations.
  • ...and 10 more figures