Table of Contents
Fetching ...

Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati, Xijie Zeng, Hong Huang, Sebastian Dionicio, Subhabrata Majumdar, Frank Rudzicz, Hassan Sajjad

TL;DR

In adversarial settings, this work establishes a fundamental limit on a broad class of convergence rate control methods including its own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size.

Abstract

Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.

Limits of Convergence-Rate Control for Open-Weight Safety

TL;DR

In adversarial settings, this work establishes a fundamental limit on a broad class of convergence rate control methods including its own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size.

Abstract

Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.
Paper Structure (105 sections, 32 theorems, 204 equations, 12 figures, 29 tables, 1 algorithm)

This paper contains 105 sections, 32 theorems, 204 equations, 12 figures, 29 tables, 1 algorithm.

Key Result

Proposition 1

For gradient descent on an $L$-smooth loss function $\mathcal{L}$, the number of iterations $k$ needed to achieve $\underset{0 \leq i \leq k-1}{\min} \|\nabla \mathcal{L}_{\theta_i}\| \le \epsilon$, is $k \ge ( L\|\theta_0 - \theta_*\| )^2/\epsilon^2$, where $\theta_0$ is the initial model parameter

Figures (12)

  • Figure 1: Convergence Rate Control: $\sigma_1(H^{\mathcal{L}}_{\theta})$ and convergence rate (iterations) is increased with $\sigma_1(\theta_i)$
  • Figure 2: Style transfer attack: SpecDef maintains low similarity to Van Gogh references throughout 50 epochs of LoRA fine-tuning, while ESD and IMMA recover the erased style within 10 epochs. Lower similarity indicates stronger resistance.
  • Figure 3: Steps needed for optimizing a 2D convex quadratic function depend on Hessian spectral values, $\sigma_1, \sigma_2$. Maximum convergent learning rate is used and $\sigma_{2}$ is fixed at 1.
  • Figure 4: Illustration of the principal angle lower bound
  • Figure 5: Qualitative NSFW results under an I2P prompt using the same random seed. We compare the images generated by spectral deformation protected ESD with the generations from Vanilla SD v1-4, nudity-erased SD (ESD), unprotected ESD after LoRA adaptation, and IMMA-protected ESD after LoRA adaptation. We further vary the largest singular values multipliers ($\alpha \in \{100, 1000, 10000\}$) and the layer selection strategies (random 1, 5, 25 layers, or all cross-attention layers).
  • ...and 7 more figures

Theorems & Definitions (74)

  • Proposition 1: $L$-smooth gradient descent iteration complexity
  • Proposition 2: Nesterov iteration complexity
  • Theorem 3: Hessian Singular Value Lower Bound
  • proof : Proof Sketch
  • Example 4: Three-layer MLP
  • Definition 5: Lower-Max Spectral Reparameterization
  • Theorem 6: SpecDef is a (Lower-Max) Spectral Reparameterization
  • Theorem 7: Only Weight Matrices Provide Unbounded Spectral Control
  • proof : Proof Sketch
  • Theorem 8: Cost of Undoing Spectral Reparameterization
  • ...and 64 more