Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan; Ayushi Mehrotra

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra

TL;DR

This work relaxes the deterministic $k$-unstable certificate of SmoothLLM by introducing a probabilistic $(k, \varepsilon)$-unstable framework, motivated by empirical evidence that Attack Success Rates decay exponentially rather than vanish abruptly. It derives data-informed lower bounds on the defense probability $DSP$ under two perturbation schemes, RandomSwapPerturbation and RandomPatchPerturbation, and shows how to calibrate practical security thresholds through an end-to-end case study that maps risk tolerance to concrete parameters $(k, N)$. The approach yields actionable, model- and threat-specific certificates that reflect real-world attack behavior while preserving formalism, enabling practitioners to balance safety guarantees with deployment costs. By providing a systematic workflow for parameter selection and demonstrating robustness across gradient-based and semantic jailbreaking attacks, the framework advances reliable, scalable deployment of LLM safety mechanisms in practical settings.

Abstract

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict `k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, `(k, $\varepsilon$)-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

TL;DR

This work relaxes the deterministic

-unstable certificate of SmoothLLM by introducing a probabilistic

-unstable framework, motivated by empirical evidence that Attack Success Rates decay exponentially rather than vanish abruptly. It derives data-informed lower bounds on the defense probability

under two perturbation schemes, RandomSwapPerturbation and RandomPatchPerturbation, and shows how to calibrate practical security thresholds through an end-to-end case study that maps risk tolerance to concrete parameters

. The approach yields actionable, model- and threat-specific certificates that reflect real-world attack behavior while preserving formalism, enabling practitioners to balance safety guarantees with deployment costs. By providing a systematic workflow for parameter selection and demonstrating robustness across gradient-based and semantic jailbreaking attacks, the framework advances reliable, scalable deployment of LLM safety mechanisms in practical settings.

Abstract

)-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k,

)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

TL;DR

Abstract

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)