Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati; Xijie Zeng; Hong Huang; Sebastian Dionicio; Subhabrata Majumdar; Frank Rudzicz; Hassan Sajjad

Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati, Xijie Zeng, Hong Huang, Sebastian Dionicio, Subhabrata Majumdar, Frank Rudzicz, Hassan Sajjad

TL;DR

In adversarial settings, this work establishes a fundamental limit on a broad class of convergence rate control methods including its own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size.

Abstract

Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.

Limits of Convergence-Rate Control for Open-Weight Safety

TL;DR

Abstract

Paper Structure (105 sections, 32 theorems, 204 equations, 12 figures, 29 tables, 1 algorithm)

This paper contains 105 sections, 32 theorems, 204 equations, 12 figures, 29 tables, 1 algorithm.

Introduction
Convergence Rate is Determined By Spectral Values
Iteration Complexity Bounds
Controlling Convergence Rates
Bad local minima
Weight distance
Increasing curvature
A Hessian Spectral Lower Bound Dependent on Weight Spectrum
Spectral Reparameterization for Convergence Control
Experimental Validation
Setup
Comparison with baselines
Curvature-aware methods
Fundamental Limits of Convergence-Rate-Based Resistance
Experiments
...and 90 more sections

Key Result

Proposition 1

For gradient descent on an $L$-smooth loss function $\mathcal{L}$, the number of iterations $k$ needed to achieve $\underset{0 \leq i \leq k-1}{\min} \|\nabla \mathcal{L}_{\theta_i}\| \le \epsilon$, is $k \ge ( L\|\theta_0 - \theta_*\| )^2/\epsilon^2$, where $\theta_0$ is the initial model parameter

Figures (12)

Figure 1: Convergence Rate Control: $\sigma_1(H^{\mathcal{L}}_{\theta})$ and convergence rate (iterations) is increased with $\sigma_1(\theta_i)$
Figure 2: Style transfer attack: SpecDef maintains low similarity to Van Gogh references throughout 50 epochs of LoRA fine-tuning, while ESD and IMMA recover the erased style within 10 epochs. Lower similarity indicates stronger resistance.
Figure 3: Steps needed for optimizing a 2D convex quadratic function depend on Hessian spectral values, $\sigma_1, \sigma_2$. Maximum convergent learning rate is used and $\sigma_{2}$ is fixed at 1.
Figure 4: Illustration of the principal angle lower bound
Figure 5: Qualitative NSFW results under an I2P prompt using the same random seed. We compare the images generated by spectral deformation protected ESD with the generations from Vanilla SD v1-4, nudity-erased SD (ESD), unprotected ESD after LoRA adaptation, and IMMA-protected ESD after LoRA adaptation. We further vary the largest singular values multipliers ($\alpha \in \{100, 1000, 10000\}$) and the layer selection strategies (random 1, 5, 25 layers, or all cross-attention layers).
...and 7 more figures

Theorems & Definitions (74)

Proposition 1: $L$-smooth gradient descent iteration complexity
Proposition 2: Nesterov iteration complexity
Theorem 3: Hessian Singular Value Lower Bound
proof : Proof Sketch
Example 4: Three-layer MLP
Definition 5: Lower-Max Spectral Reparameterization
Theorem 6: SpecDef is a (Lower-Max) Spectral Reparameterization
Theorem 7: Only Weight Matrices Provide Unbounded Spectral Control
proof : Proof Sketch
Theorem 8: Cost of Undoing Spectral Reparameterization
...and 64 more

Limits of Convergence-Rate Control for Open-Weight Safety

TL;DR

Abstract

Limits of Convergence-Rate Control for Open-Weight Safety

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (74)