Table of Contents
Fetching ...

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

TL;DR

This work proposes a novel unlearning method-Partial Model Collapse (PMC), which overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility.

Abstract

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

TL;DR

This work proposes a novel unlearning method-Partial Model Collapse (PMC), which overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility.

Abstract

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

Paper Structure

This paper contains 33 sections, 1 theorem, 26 equations, 35 figures, 14 tables, 1 algorithm.

Key Result

Proposition C.3

Iteratively relearning of a categorical distribution $\pi_t$ on its own generated data yields model collapse independent of the initial distribution.

Figures (35)

  • Figure 1: We propose method, a novel unlearning method that leverages the principles of model collapse to remove information from LLMs. By iteratively fine-tuning LLMs on their own generated responses, we trigger distribution collapse conditionally for sensitive questions, effectively removing information from model outputs. Unlike (1) fine-tuning on fixed refusals such as "I don't know", or (2) using gradient ascent to optimize against fixed ground-truth sequences, method fine-tunes on responses the model is already likely to generate. This allows us to achieve more effective and robust unlearning without requiring fixed ground-truth sequences in the fine-tuning data.
  • Figure 2: Unlearning through iterative MLE-relearning for categorical distributions. The model's knowledge about all other categories vanishes over time until it models target categories (bold) only.
  • Figure 3: Partial model collapse (PMC) significantly dominates baselines and expands the Pareto-front w.r.t. utility and unlearn quality for (a) Phi-1.5, (b) Llama-3.2-3B-Instruct and (c) Gemma-3-12b-it. While existing methods (GA, GD, DPO, NPO, SimNPO, and IDK) also unlearn, they cannot deviate much from the fine-tuned model without compromising the model's general capabilities. Orange vertical lines indicate utility of fine-tuned models before unlearning. Stars represent dominating points. For improved accessibility we provide this plot with symbols instead of colors in \ref{['app:add-results']}.
  • Figure 4: PMC is more robust against sampling and prefilling attacks. Lower average worst-case leakage is better.
  • Figure 5: Limitations of unlearning methods optimizing on unlearning targets: (a) Side effects on unrelated datasets. (b) Accuracy when selecting least likely answer across quantiles (black line is random guessing). (c) Distribution of minimum probabilities across all multiple-choice options.
  • ...and 30 more figures

Theorems & Definitions (7)

  • Definition C.1: Categorical distribution
  • Definition C.2: Model collapse
  • Proposition C.3
  • proof : Full proof.
  • proof
  • proof
  • proof