Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, Yang Zhang
TL;DR
This work identifies a vulnerability in safety-aligned LLMs: they rarely generate fallacious reasoning and often leak truthful content when prompted to do so. The authors introduce Fallacy Failure Attack (FFA), which transforms a malicious query into a fallacious but deceptively realistic prompt that bypasses safeguards while eliciting factual, harmful outputs, without fine-tuning or multi-turn interaction. Across five LLMs and multiple datasets, FFA delivers competitive or superior harmful outputs compared with prior jailbreak methods, and standard defenses offer limited protection, underscoring a pressing safety risk. The study also analyzes how scene/purpose prompts affect efficacy and discusses implications for defenses, self-verification, and addressing hallucinations in future LLM development.
Abstract
We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.
