Gauss-Newton Unlearning for the LLM Era

Lev McKinney; Anvith Thudi; Juhan Bae; Tara Rezaei; Nicolas Papernot; Sheila A. McIlraith; Roger Grosse

Gauss-Newton Unlearning for the LLM Era

Lev McKinney, Anvith Thudi, Juhan Bae, Tara Rezaei, Nicolas Papernot, Sheila A. McIlraith, Roger Grosse

TL;DR

Gauss-Newton Unlearning for the LLM Era introduces K-FADE, a second-order unlearning method for large language models that uses forget/retain distributions and EK-FAC/K-FAC Hessian approximations to compute a small number of Gauss-Newton steps. By converting a constraint on the retain distribution into a weight-space update, it achieves simultaneous output suppression on the forget set and minimal disruption to retained behavior, often matching or approximating retraining without forgetting. Empirically, K-FADE delivers state-of-the-art performance on WMDP and ToFU benchmarks, preserves specificity (lower KL divergence on retained data), and supports transferring unlearning updates to finetuned models, with runtime competitive to first-order methods. Limitations include the absence of formal unlearning guarantees and vulnerability to full-rank fine-tuning attacks, but the approach scales to frontier models and suggests practical directions for benchmarking and privacy-preserving unlearning.

Abstract

Standard large language model training can create models that produce outputs their trainer deems unacceptable in deployment. The probability of these outputs can be reduced using methods such as LLM unlearning. However, unlearning a set of data (called the forget set) can degrade model performance on other distributions where the trainer wants to retain the model's behavior. To improve this trade-off, we demonstrate that using the forget set to compute only a few uphill Gauss-Newton steps provides a conceptually simple, state-of-the-art unlearning approach for LLMs. While Gauss-Newton steps adapt Newton's method to non-linear models, it is non-trivial to efficiently and accurately compute such steps for LLMs. Hence, our approach crucially relies on parametric Hessian approximations such as Kronecker-Factored Approximate Curvature (K-FAC). We call this combined approach K-FADE (K-FAC for Distribution Erasure). Our evaluation on the WMDP and ToFU benchmarks demonstrates that K-FADE suppresses outputs from the forget set and approximates, in output space, the results of retraining without the forget set. Critically, our method does this while altering the outputs on the retain set less than previous methods. This is because K-FADE transforms a constraint on the model's outputs across the entire retain set into a constraint on the model's weights, allowing the algorithm to minimally change the model's behavior on the retain set at each step. Moreover, the unlearning updates computed by K-FADE can be reapplied later if the model undergoes further training, allowing unlearning to be cheaply maintained.

Gauss-Newton Unlearning for the LLM Era

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Preliminaries
Our LLM Unlearning Threat Models and Notation
Problem Settings
Output Suppression and the Gauss-Newton Step
Approximate Retraining and the Gauss-Newton Step
Efficient Second-Order Approximations
How EK-FAC's Compute Requirements Scale
Methods: how to implement your Gauss-Newton step
Transferring Unlearning Updates to Finetuned Models
Experiments
Can K-FADE suppress "harmful" knowledge while maintaining specificity?
Can K-FADE Approximate Retraining Outputs?
What Makes a Single Step of K-FADE Effective?
Can K-FADE Reestablish Unlearning After Fine-Tuning?
...and 3 more sections

Figures (6)

Figure 1: How Gauss-Newton ascent achieves output suppression while maintaining performance on a retain set. Here we show how Gauss-Newton ascent relaxes a constraint satisfaction problem for unlearning in the output space of a model (represented by the logits $z$) and converts it into a simple (though very high dimensional) linear algebra problem in weight space (represented by $\theta$ variables). Here $\rho(\cdot|z)$ denotes the distribution of tokens (i.e., labels) specified by the model output logits. This allows us to simultaneously satisfy a constraint on the model's output across millions of tokens at every step of the algorithm. See Section \ref{['sec:natgrad']} for more details.
Figure 2: K-FADE changes the model's behavior on unrelated data less than strong baselines like ELM elm2024 and RMU li_wmdp_2024. The y-axis plots one minus the cumulative density function for the KL divergences (x-axis) between completions (from prompts in the alpaca dataset) generated by the original model (zephyr-7b-$\beta$) and models unlearned with ELM, RMU and K-FADE on the WDMP Bio and Cyber subsets. Shaded regions show the 95% bootstrap confidence interval on the quantiles.
Figure 3: One-step of K-FADE outperforms the state of the art in unlearning on the TOFU dataset. The left figure plots Forget Quality (similarity of outputs to the outputs of the ideal retrained model, defined in Section \ref{['sec:forget_quality']}) as a function of Model Utility (degradation of model performance, defined in Section \ref{['sec:forget_quality']}) when unlearning 5% of the authors bios. The right figure plots the same comparison when unlearning 10% of author bios. K-FADE effectively outperforms both of the baseline methods provided in the original TOFU paper maini_tofu_2024 and a recent state of the art method simNPO fan2024simplicity. The star represents the results of retraining a model on only the retain set.
Figure 4: Parametric second-order methods can efficiently trade off specificity for speed. We evaluate several variations of our method using Phi 1.5 textbooks2 on the TOFU benchmark under the 10% forget setting. On the left we compare Forget Quality (which measures similarity of model outputs to a retrained model, see Section \ref{['sec:forget_quality']}) across varying keep (retain) set KL divergences for each of the methods. On the right we compare the computational time of each method. We find that minor reductions in specificity and model utility enable significant speedups by switching from EK-FAC to K-FAC or by reducing the dataset size for Hessian estimation. Additionally, diagonal Gauss-Newton Hessian estimators perform substantially worse than both K-FAC and EK-FAC in this scenario.
Figure 5: K-FADE updates can be re-applied after fine-tuning to maintain the unlearning effect. We measure the accuracy on the WMDP forget set (y-axis) after fine-tuning the unlearnt model on different datasets (x-axis): lower accuracy is better. Like past methods, K-FADE is not resistant to (full rank) fine-tuning. However, we find that the update directions can be applied after fine-tuning, preserving the unlearning effect, depicted by the dashed bars. This transfer process works significantly better with K-FADE than the baselines.
...and 1 more figures

Gauss-Newton Unlearning for the LLM Era

TL;DR

Abstract

Gauss-Newton Unlearning for the LLM Era

Authors

TL;DR

Abstract

Table of Contents

Figures (6)