The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances
Yining Wang, Yuquan Wang, Xi Li, Mi Zhang, Geng Hong, Min Yang
TL;DR
The paper investigates factual hallucinations in large language models by introducing The Illusionist's Prompt, a black-box adversarial method that uses six linguistic mutation guidelines to rewrite normal prompts into semantically preserved but deceptive variants. This approach increases semantic entropy during inference and bypasses five categories of fact-enhancing defenses, yielding more factual errors across both open and commercial LLMs. Through extensive experiments on the TruthfulQA benchmark and multiple models, the authors demonstrate transferable adversarial effects, including across GPT-4o and Gemini-2.0, and provide ablations, baselines, and adaptive mitigation analyses. The work highlights a significant vulnerability in current factuality mitigation strategies and motivates future defenses that account for linguistic nuance and cross-model transferability, with implications for downstream applications and data integrity.
Abstract
As Large Language Models (LLMs) continue to advance, they are increasingly relied upon as real-time sources of information by non-expert users. To ensure the factuality of the information they provide, much research has focused on mitigating hallucinations in LLM responses, but only in the context of formal user queries, rather than maliciously crafted ones. In this study, we introduce The Illusionist's Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries, challenging the factual accuracy of LLMs against five types of fact-enhancing strategies. Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics. Extensive experiments confirm the effectiveness of our attack in compromising black-box LLMs, including commercial APIs like GPT-4o and Gemini-2.0, even with various defensive mechanisms.
