Table of Contents
Fetching ...

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

Abstract

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We derive these behaviors analytically and confirm them empirically on large language models. This transition between two regimes is due to the appearance of an ordered phase in the spin chain under a strong magnetic field, which suggests that the injected jailbreak prompt enhances adversarial order in the language model.

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Abstract

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We derive these behaviors analytically and confirm them empirically on large language models. This transition between two regimes is due to the appearance of an ordered phase in the spin chain under a strong magnetic field, which suggests that the injected jailbreak prompt enhances adversarial order in the language model.
Paper Structure (17 sections, 22 equations, 6 figures)

This paper contains 17 sections, 22 equations, 6 figures.

Figures (6)

  • Figure 1: Attack success rate $\Pi_k$ is plotted against number of inference time samples $k$. The experiment is performed with Mistral-7B-Instruct-v0.3 acting as the judge of jailbreaking on walledai/AdvBench dataset using the attack method in zou2023universal.
  • Figure 2: A schematic view of the low energy landscape of large number of spins interacting via the spin-glass Hamiltonian. In low temperature replica symmetry breaking phase the Gibbs measure decomposes into many hierarchically organized (based on overlaps) pure states/clusters; it is common to picture these states as ‘valleys’ or ‘basins’ in a energy landscape. Following PhysRevE.79.051117zhou2011random, we approximate the overlap based clustering in replica symmetry breaking phase in terms of basin of attraction associated with several local minima. Here we presented what typical sampling of the size biased ordering would look like - dots signify individual spin configurations.
  • Figure 3: Sample two configurations $\sigma, \tau$ independently from the Gibbs measure. $R(\sigma,\tau)$ concentrates at $q_{(\sigma, \tau)}$, $(\sigma, \tau)$ is the first level at which they differ. The distance $d(\sigma,\tau)=(1-R(\sigma,\tau))/2$ is ultrametric: $R(\alpha_1, \alpha_3)\geq \text{min} (R(\alpha_1, \alpha_2), R(\alpha_2, \alpha_3))$.
  • Figure 4: Attack success rate in spin-glass-based model is plotted numerically for $N=24, p=2, \beta=10, j_0=1$. The teacher-student setup is matched for $\beta, j_0$ with additional magnetic field $h$ turned on for the student along the $m=1$ teacher cluster at the lowest level. On the left we compare the numerical plot against the one coming from Theorem \ref{['thm:low_field']}, i.e., $\log(-\log(\Pi_k))=-\nu \log k-\nu \lambda+\log C_{m}$, and see that for small $\lambda$ or equivalently $h$ the graphs are in good agreement in the domain of validity of the theoretical result $N \gg k \gg 1 \gg k\lambda^2$. As we increase $h$ violating $k \lambda^2 \ll 1$ we see that the experimental results differ significantly from the prediction of Theorem \ref{['thm:low_field']} - in this domain it is meaningful to fit the experimental results to a form suggested by Theorem \ref{['thm:low_field']} and Theorem \ref{['thm:high_field']}, i.e., $\log(-\log(\Pi_k))=-\hat{\nu} \log k-\hat{\mu} k+\log \hat{C}$. From the plot on the right, we see that it is possible to find a reasonable fit to this from in the large-$h$ regime in which both $\hat{\mu}, \hat{\nu}$ increase monotonically with $h$.
  • Figure 5: Attack success rate measurement based on refusal string and GPT-4 as LLM-judge (left) and analysis of harmfulness of jailbroken response for various attack methods (right).
  • ...and 1 more figures

Theorems & Definitions (5)

  • proof
  • Remark 1
  • Remark 2
  • proof
  • Remark 3