Table of Contents
Fetching ...

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Zhaoxin Zhang, Borui Chen, Yiming Hu, Youyang Qu, Tianqing Zhu, Longxiang Gao

TL;DR

The paper reveals a new vulnerability in LLM safety where outputs can be steered toward extremist ideologies through abstract conceptual manipulation rather than explicit harmful prompts. It introduces Morphology Inspired Conceptual Manipulation (MICM), a model-agnostic jailbreak that uses Concept-Embedded Triggers (CETs) within a fixed template to influence the model’s underlying ideological orientation, quantified by a five-dimension Ideological Alignment Score (IAS). Across 120 real-world incidents and five advanced LLMs, MICM outperforms state-of-the-art jailbreak baselines with high ASR and IAS and minimal rejection, highlighting a vulnerability in safety mechanisms to covert value manipulation. The work advocates cross-disciplinary defenses and enhanced detection of ideological alignment to mitigate such conceptual attacks on commercial LLMs.

Abstract

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

TL;DR

The paper reveals a new vulnerability in LLM safety where outputs can be steered toward extremist ideologies through abstract conceptual manipulation rather than explicit harmful prompts. It introduces Morphology Inspired Conceptual Manipulation (MICM), a model-agnostic jailbreak that uses Concept-Embedded Triggers (CETs) within a fixed template to influence the model’s underlying ideological orientation, quantified by a five-dimension Ideological Alignment Score (IAS). Across 120 real-world incidents and five advanced LLMs, MICM outperforms state-of-the-art jailbreak baselines with high ASR and IAS and minimal rejection, highlighting a vulnerability in safety mechanisms to covert value manipulation. The work advocates cross-disciplinary defenses and enhanced detection of ideological alignment to mitigate such conceptual attacks on commercial LLMs.

Abstract

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of MICM attack using a set of pre-defined, model-agnostic concept-embedded triggers to manipulate the underlying ideological orientation of LLM-generated content.
  • Figure 2: Illustration of MICM methodology. The green segments represent the original query, while the pink segments indicate content associated with ideological themes.
  • Figure 3: Score distributions of MICM attack results using KDE-plot with bandwidth=0.5.
  • Figure 4: Cumulative Score distributions of MICM attack results using CDF-plot.
  • Figure 5: Comparison of Average IAS Across Models Under Jailbreak Attacks
  • ...and 2 more figures