Table of Contents
Fetching ...

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Mengyao Du, Han Fang, Haokai Ma, Gang Yang, Quanjun Yin, Shouling Ji, Ee-Chien Chang

TL;DR

TrapSuffix introduces a proactive defense against suffix-based jailbreaks by fine-tuning base LLMs with trap-aligned behaviors via low-rank adapters. It reshapes the adversarial suffix optimization landscape to yield either deceptive, non-harmful local minima or suffixes with distinctive, traceable fingerprints, delivering strong jailbreak suppression (ASR < $0.01\%$) and high traceability (TSR ≈ $87.9\%$) while incurring negligible inference-time overhead and minimal memory cost. The approach is validated across multiple open-source models and attack strategies, demonstrating robustness to adaptive attackers and preserving general utility on standard benchmarks. TrapSuffix is modular, plug-and-play, and compatible with existing alignment pipelines, offering a scalable path toward practical, auditable safety defenses in real-world deployments.

Abstract

Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent, providing both strong defense and reliable traceability. It introduces no inference-time overhead and incurs negligible memory cost, requiring only 15.87 MB of additional memory on average, whereas state-of-the-art LLM-based detection defenses typically incur memory overheads at the 1e4 MB level, while composing naturally with existing filtering-based defenses for complementary protection.

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

TL;DR

TrapSuffix introduces a proactive defense against suffix-based jailbreaks by fine-tuning base LLMs with trap-aligned behaviors via low-rank adapters. It reshapes the adversarial suffix optimization landscape to yield either deceptive, non-harmful local minima or suffixes with distinctive, traceable fingerprints, delivering strong jailbreak suppression (ASR < ) and high traceability (TSR ≈ ) while incurring negligible inference-time overhead and minimal memory cost. The approach is validated across multiple open-source models and attack strategies, demonstrating robustness to adaptive attackers and preserving general utility on standard benchmarks. TrapSuffix is modular, plug-and-play, and compatible with existing alignment pipelines, offering a scalable path toward practical, auditable safety defenses in real-world deployments.

Abstract

Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent, providing both strong defense and reliable traceability. It introduces no inference-time overhead and incurs negligible memory cost, requiring only 15.87 MB of additional memory on average, whereas state-of-the-art LLM-based detection defenses typically incur memory overheads at the 1e4 MB level, while composing naturally with existing filtering-based defenses for complementary protection.
Paper Structure (46 sections, 20 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 46 sections, 20 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of adversarial suffix optimization under our TrapSuffix defense. An attacker optimizes an adversarial suffix by searching over the suffix space to minimize the adversarial loss. Our defense reshapes the adversarial optimization landscape, yielding two outcomes: attacks are either trapped in a rugged landscape with deceptive local minima, or succeed only by producing traceable suffix fingerprints.
  • Figure 2: FPR (↑) and ASR (↓) across three target models, comparing no defense and TrapSuffix.
  • Figure 3: Optimization trajectories of adversarial suffixes under adaptive optimization budgets, visualized in a shared PCA space with an interpolated loss landscape.
  • Figure 4: Distribution of traceability scores across GCG optimization steps for three models. Each violin plot shows the density distribution of traceability scores at different optimization steps (10, 210, 410). Red lines indicate mean values.
  • Figure 5: TrapSuffix Defended models consistently reshape gradient-based optimization behaviors across attack settings.
  • ...and 6 more figures