Table of Contents
Fetching ...

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra

TL;DR

This work relaxes the deterministic $k$-unstable certificate of SmoothLLM by introducing a probabilistic $(k, \varepsilon)$-unstable framework, motivated by empirical evidence that Attack Success Rates decay exponentially rather than vanish abruptly. It derives data-informed lower bounds on the defense probability $DSP$ under two perturbation schemes, RandomSwapPerturbation and RandomPatchPerturbation, and shows how to calibrate practical security thresholds through an end-to-end case study that maps risk tolerance to concrete parameters $(k, N)$. The approach yields actionable, model- and threat-specific certificates that reflect real-world attack behavior while preserving formalism, enabling practitioners to balance safety guarantees with deployment costs. By providing a systematic workflow for parameter selection and demonstrating robustness across gradient-based and semantic jailbreaking attacks, the framework advances reliable, scalable deployment of LLM safety mechanisms in practical settings.

Abstract

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict `k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, `(k, $\varepsilon$)-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

TL;DR

This work relaxes the deterministic -unstable certificate of SmoothLLM by introducing a probabilistic -unstable framework, motivated by empirical evidence that Attack Success Rates decay exponentially rather than vanish abruptly. It derives data-informed lower bounds on the defense probability under two perturbation schemes, RandomSwapPerturbation and RandomPatchPerturbation, and shows how to calibrate practical security thresholds through an end-to-end case study that maps risk tolerance to concrete parameters . The approach yields actionable, model- and threat-specific certificates that reflect real-world attack behavior while preserving formalism, enabling practitioners to balance safety guarantees with deployment costs. By providing a systematic workflow for parameter selection and demonstrating robustness across gradient-based and semantic jailbreaking attacks, the framework advances reliable, scalable deployment of LLM safety mechanisms in practical settings.

Abstract

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict `k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, `(k, )-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, )-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Paper Structure

This paper contains 28 sections, 2 theorems, 24 equations, 10 figures.

Key Result

Proposition 1

Let $\mathcal{A}$ denote an alphabet of size $v$ and let $P = [G; S] \in \mathcal{A}^m$ denote an input prompt to a given LLM where $G \in \mathcal{A}^{m_G}$ and $S \in \mathcal{A}^{m_S}$ with $m = m_G + m_S$. Let $M = \lfloor qm \rfloor$ denote the number of characters perturbed. Assume that $S$ is where $\alpha$ is bounded below by:

Figures (10)

  • Figure 1: Attack success rate on Llama2 as a function of the number of perturbed characters $k$ using RandomPatchPerturbation and GCG attack.
  • Figure 2: Attack success rate on Llama2 as a function of the number of perturbed characters $k$ using RandomSwapPerturbation and GCG attack.
  • Figure 3: Pipeline for tuning SmoothLLM for obtaining a certified defense success probability (DSP) given the model and attack type. Detailed analysis in Section 3.7.
  • Figure 4: Certified Defense Success Probability (DSP) versus the number of samples $N$ with RandomSwapPerturbation, using the fitted attack success rate model. We assume a total prompt length of $m = 240$ characters, an adversarial suffix length of $m_S = 100$, perturbation rate $q = 0.10$, threshold $k = 8$, and $\varepsilon = 0.05$.
  • Figure 5: Attack success rate on Vicuna as a function of the number of perturbed characters $k$ using RandomPatchPerturbation and GCG attack.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Definition 1: SmoothLLM
  • Definition 2: $k$-unstable
  • Definition 3
  • Proposition 1: (k, $\varepsilon$)-Unstable Certificate for RandomSwapPerturbation
  • proof : Proof Sketch
  • Remark 1: Tightness of the bound
  • Remark 2: Relation to the Original Certificate
  • Proposition 2: ($k,\varepsilon$)-Unstable Certificate for RandomPatchPerturbation
  • proof : Proof Sketch
  • Remark 3: Practical Interpretation
  • ...and 2 more