Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

Jacob L. Block; Mehryar Mohri; Aryan Mokhtari; Sanjay Shakkottai

Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

Jacob L. Block, Mehryar Mohri, Aryan Mokhtari, Sanjay Shakkottai

TL;DR

This work reframes unlearning in generative models as distribution correction via density-ratio estimation, modeling the data as a mixture $p(z)=(1-\gamma)p_r(z)+\gamma p_f(z)$. It shows that classical classifier-guided tilting can leak forget-set information when $p_f$ is sharply concentrated, and introduces Temper-Then-Tilt Unlearning (T3-Unlearning) that first temper the base distribution by a factor $p(z)^{1/T}$ and then tilt with a learned classifier ${\hat f}$, yielding $\hat p^{(T)}_r(z) \propto p(z)^{1/T} \hat f(z)$. The paper proves finite-sample guarantees linking classifier excess risk $\delta$ to Retain and Forget Errors, showing tempered estimation reduces Forget-Error scaling from linear in $\|p_f\|_\infty$ to $\|p_f\|_\infty^{1/T}$ plus a tempering bias, at the cost of a tempered Retain-Error bias. Empirically, on the TOFU benchmark with Llama 3.1 8B, T3-Unlearning outperforms baselines in forget quality while maintaining high generative utility, and achieves this with a tiny fraction of trainable parameters and faster runtimes. Overall, the approach provides principled, efficient unlearning with provable guarantees and strong practical performance for large language models.

Abstract

We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution rather than supervised fine-tuning. While classifier guidance is a standard approach for approximating this ratio and can succeed in general, we show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution. To address this, we introduce Temper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure: (i) tempering the base distribution to flatten high-confidence spikes, and (ii) tilting the tempered distribution using a lightweight classifier trained to distinguish retain from forget samples. Our theoretical analysis provides finite-sample guarantees linking the surrogate classifier's risk to unlearning error, proving that tempering is necessary to successfully unlearn for concentrated distributions. Empirical evaluations on the TOFU benchmark show that T3-Unlearning improves forget quality and generative utility over existing baselines, while training only a fraction of the parameters with a minimal runtime.

Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

TL;DR

This work reframes unlearning in generative models as distribution correction via density-ratio estimation, modeling the data as a mixture

. It shows that classical classifier-guided tilting can leak forget-set information when

is sharply concentrated, and introduces Temper-Then-Tilt Unlearning (T3-Unlearning) that first temper the base distribution by a factor

and then tilt with a learned classifier

, yielding

. The paper proves finite-sample guarantees linking classifier excess risk

to Retain and Forget Errors, showing tempered estimation reduces Forget-Error scaling from linear in

plus a tempering bias, at the cost of a tempered Retain-Error bias. Empirically, on the TOFU benchmark with Llama 3.1 8B, T3-Unlearning outperforms baselines in forget quality while maintaining high generative utility, and achieves this with a tiny fraction of trainable parameters and faster runtimes. Overall, the approach provides principled, efficient unlearning with provable guarantees and strong practical performance for large language models.

Abstract

Paper Structure (34 sections, 13 theorems, 139 equations, 5 figures, 5 tables)

This paper contains 34 sections, 13 theorems, 139 equations, 5 figures, 5 tables.

Introduction
Related Work
Proposed Method: T3-Unlearning
Probabilistic Formulation
T3-Unlearning Procedure
LLM Implementation
Theoretical Guarantees
Unlearning Metrics
Surrogate Analysis Framework
Unlearning Guarantees: Untempered Estimator
Unlearning Guarantees: The Tempered Estimator
Experiments
Unlearning Metrics
Empirical Results
Unlearning Efficiency
...and 19 more sections

Key Result

Theorem 3.3

Let $\hat{f} \in \mathcal{F}$ satisfy the excess risk bound $L(\hat{f}) - L(f^\ast) \leq \delta$. Then the Retain Error eq:retain-error of the untempered estimate $\hat{p}_r^{\, (1)}\xspace$ in eq:t3-estimator-def satisfies

Figures (5)

Figure 1: T3-Unlearning for LLMs. We freeze the base model and train a linear head (shaded) on pooled hidden states to predict the vector $\bm{g}\xspace_{\bm{\phi}\xspace}(\bm{x}\xspace)$ of class posteriors for all possible next tokens. In training, we apply the loss to the entry $[\bm{g}\xspace_{\bm{\phi}\xspace}(\bm{x}\xspace)]_y$ corresponding to the estimator of the class posterior $\mathbb{P}_{} \left( s=1 \mid \bm{x}\xspace,y \right)\xspace$, while the entire vector $\bm{g}\xspace_{\bm{\phi}\xspace}(\bm{x}\xspace)$ tilts the base model's tempered logits for inference.
Figure 2: Retain and Forget Errors as a function of forget set component variance $v_f$ and base model temperature $T$.
Figure 3: Retain and Forget Errors as a function of sample size $n$ and temperature $T$.
Figure 4: Example data and learned classifier for Experiment 2 with $n=25$ samples. The y-axis tracks both the sample counts for the generated samples (left) and the probabilities assigned by the classifier (right).
Figure 5: Estimated retain set densities $\hat{p}_r^{\, (T)}\xspace$ in the setting of Experiment 2 for $n=25$ samples and temperature $T \in \{1.0, 1.5, 2.0\}$.

Theorems & Definitions (25)

Remark 3.1
Remark 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Remark 3.6: Limitations of Classifier Sharpening
Theorem 3.7
Theorem 3.8
Theorem
proof : Proof of Theorem \ref{['th:risk-to-retain-err-untempered']}
...and 15 more

Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

TL;DR

Abstract

Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (25)