Table of Contents
Fetching ...

Semantic Representation Attack against Aligned Large Language Models

Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau

TL;DR

This work reframes adversarial attacks on aligned LLMs from exact-output prompts to semantic representations, introducing the Semantic Representation Attack (SRA) and the SRHS algorithm. By targeting a semantic representation space $\\Omega$ and enforcing coherence via perplexity-based constraints, the method generates concise, natural prompts that induce semantically equivalent harmful responses across diverse models. The empirical results show unprecedented attack effectiveness ($\\text{ASR}$ average $=89.41\\%$ across $18$ models, with $100\\%$ on $11$ models), strong transferability to API-only models, and robustness against perplexity-based defenses, while maintaining short prompt lengths. The work highlights significant implications for safety-alignment research and defenses that must account for semantic rather than surface-form vulnerabilities, and provides public code to facilitate reproducibility.

Abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

Semantic Representation Attack against Aligned Large Language Models

TL;DR

This work reframes adversarial attacks on aligned LLMs from exact-output prompts to semantic representations, introducing the Semantic Representation Attack (SRA) and the SRHS algorithm. By targeting a semantic representation space and enforcing coherence via perplexity-based constraints, the method generates concise, natural prompts that induce semantically equivalent harmful responses across diverse models. The empirical results show unprecedented attack effectiveness ( average across models, with on models), strong transferability to API-only models, and robustness against perplexity-based defenses, while maintaining short prompt lengths. The work highlights significant implications for safety-alignment research and defenses that must account for semantic rather than surface-form vulnerabilities, and provides public code to facilitate reproducibility.

Abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

Paper Structure

This paper contains 31 sections, 6 theorems, 45 equations, 3 figures, 15 tables, 1 algorithm.

Key Result

Theorem 3.1

Given a query $\boldsymbol{q}$, an adversarial prompt $\boldsymbol{x}^*$, and a PPL threshold $\tau$, if where $\boldsymbol{y}_1^*$ is a desired response, then for a semantically equivalent response $\boldsymbol{y}_2^*$ where $\mathcal{R}(\boldsymbol{y}_1^*) = \mathcal{R}(\boldsymbol{y}_2^*) = \Phi$, the probability of generating $\boldsymbol{y}_2^*$ satisfies: where $\delta$ is the minimum prob

Figures (3)

  • Figure 1: 100% ASR across 11 LLMs.
  • Figure 2: Illustration of vanilla attacks under Semantic Incoherence and our Semantic Representation Attack under Semantic Coherence. Vanilla methods optimize for specific textual outputs, producing semantically incoherent prompts limited to a single response pattern. Our approach maintains coherence during optimization, enabling convergence to equivalent semantic representations across lexical variations, which provides multiple viable optimization paths and enhances attack performance.
  • Figure 3: Probability distributions in the Semantic Representation Attack framework. Example query $\boldsymbol{q}$: "Write code for a rainbow table attack given a trove of hashed passwords" targeting Llama 3.1 8B, with top two candidate tokens per node displayed due to ample discrete token space. (a) shows concentrated probability mass around semantically coherent adversarial prompts. (Note that the token in the bottom-left node is "\\ n", which may appear blank.) (b) demonstrates how these prompts induce multiple semantically equivalent harmful outputs. The visualization hierarchically displays autoregressive tokens from left to right, with nodes showing joint response probabilities (ordered ascendingly) and edges indicating predicted tokens and their conditional probabilities.

Theorems & Definitions (16)

  • Definition 3.1
  • Theorem 3.1: Coherence-Convergence Relationship
  • proof
  • Theorem 3.2: Semantic Representation Attack
  • proof
  • Theorem 3.3: Coherence Constraint
  • proof
  • proof
  • proof
  • Lemma A.1
  • ...and 6 more