Table of Contents
Fetching ...

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, Yang Zhang

TL;DR

This work identifies a vulnerability in safety-aligned LLMs: they rarely generate fallacious reasoning and often leak truthful content when prompted to do so. The authors introduce Fallacy Failure Attack (FFA), which transforms a malicious query into a fallacious but deceptively realistic prompt that bypasses safeguards while eliciting factual, harmful outputs, without fine-tuning or multi-turn interaction. Across five LLMs and multiple datasets, FFA delivers competitive or superior harmful outputs compared with prior jailbreak methods, and standard defenses offer limited protection, underscoring a pressing safety risk. The study also analyzes how scene/purpose prompts affect efficacy and discusses implications for defenses, self-verification, and addressing hallucinations in future LLM development.

Abstract

We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

TL;DR

This work identifies a vulnerability in safety-aligned LLMs: they rarely generate fallacious reasoning and often leak truthful content when prompted to do so. The authors introduce Fallacy Failure Attack (FFA), which transforms a malicious query into a fallacious but deceptively realistic prompt that bypasses safeguards while eliciting factual, harmful outputs, without fine-tuning or multi-turn interaction. Across five LLMs and multiple datasets, FFA delivers competitive or superior harmful outputs compared with prior jailbreak methods, and standard defenses offer limited protection, underscoring a pressing safety risk. The study also analyzes how scene/purpose prompts affect efficacy and discusses implications for defenses, self-verification, and addressing hallucinations in future LLM development.

Abstract

We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.
Paper Structure (31 sections, 1 equation, 7 figures, 6 tables)

This paper contains 31 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A prompt containing malicious behavior can be rejected by a human-value aligned language model. However, when asked to generate a fallacious procedure for the malicious behavior, an LLM can leak the honest answer, yet believe it false.
  • Figure 2: Accuracy (compared with ground truth of answers) of fallacious and honest solutions on four different tasks by GPT-3.5-turbo.
  • Figure 3: An example where the LLM failed to provide a fallacious solution upon request but instead proposed the correct solution and contradictorily claimed it false.
  • Figure 4: A comparison between our pilot jailbreak prompts and corresponding output excerpts, with and without specification of deceptiveness.
  • Figure 5: Scatter plot of AHS and ASR from five attack and scene/purpose combinations across three language models.
  • ...and 2 more figures