Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance
Jacob L. Block, Mehryar Mohri, Aryan Mokhtari, Sanjay Shakkottai
TL;DR
This work reframes unlearning in generative models as distribution correction via density-ratio estimation, modeling the data as a mixture $p(z)=(1-\gamma)p_r(z)+\gamma p_f(z)$. It shows that classical classifier-guided tilting can leak forget-set information when $p_f$ is sharply concentrated, and introduces Temper-Then-Tilt Unlearning (T3-Unlearning) that first temper the base distribution by a factor $p(z)^{1/T}$ and then tilt with a learned classifier ${\hat f}$, yielding $\hat p^{(T)}_r(z) \propto p(z)^{1/T} \hat f(z)$. The paper proves finite-sample guarantees linking classifier excess risk $\delta$ to Retain and Forget Errors, showing tempered estimation reduces Forget-Error scaling from linear in $\|p_f\|_\infty$ to $\|p_f\|_\infty^{1/T}$ plus a tempering bias, at the cost of a tempered Retain-Error bias. Empirically, on the TOFU benchmark with Llama 3.1 8B, T3-Unlearning outperforms baselines in forget quality while maintaining high generative utility, and achieves this with a tiny fraction of trainable parameters and faster runtimes. Overall, the approach provides principled, efficient unlearning with provable guarantees and strong practical performance for large language models.
Abstract
We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution rather than supervised fine-tuning. While classifier guidance is a standard approach for approximating this ratio and can succeed in general, we show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution. To address this, we introduce Temper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure: (i) tempering the base distribution to flatten high-confidence spikes, and (ii) tilting the tempered distribution using a lightweight classifier trained to distinguish retain from forget samples. Our theoretical analysis provides finite-sample guarantees linking the surrogate classifier's risk to unlearning error, proving that tempering is necessary to successfully unlearn for concentrated distributions. Empirical evaluations on the TOFU benchmark show that T3-Unlearning improves forget quality and generative utility over existing baselines, while training only a fraction of the parameters with a minimal runtime.
