Table of Contents
Fetching ...

Revisiting the Past: Data Unlearning with Model State History

Keivan Rezaei, Mehrdad Saberi, Abhilasha Ravichander, Soheil Feizi

TL;DR

This paper tackles data-level unlearning in large language models by introducing Model State Arithmetic (MSA), a post-hoc method that leverages intermediate checkpoints to compute a forget vector and merge it into the target model. By formulating unlearning as arithmetic in parameter space, MSA aims to reproduce the behavior of a model trained without the forget data, while preserving non-forget data performance. Across TOFU, RESTOR, and MUSE-Books benchmarks, MSA consistently matches or surpasses prior unlearning methods, demonstrating strong forgetting, recovery, and utility preservation. The approach is practical because it reuses existing model state history and remains effective even when the checkpoint predates the forget targets by a substantial amount, enabling data erasure in real-world, regulated settings such as the EU Right to Be Forgotten.

Abstract

Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints on a model through complete retraining -- by repeatedly pretraining the model on datasets that exclude these specific instances -- is computationally prohibitive. To address this, unlearning algorithms have been proposed, that aim to eliminate the influence of particular datapoints at a low computational cost, while leaving the rest of the model intact. However, precisely unlearning the influence of data on a large language model has proven to be a major challenge. In this work, we propose a new algorithm, MSA (Model State Arithmetic), for unlearning datapoints in large language models. MSA utilizes prior model checkpoints -- artifacts that record model states at different stages of pretraining -- to estimate and counteract the effect of targeted datapoints. Our experimental results show that MSA achieves competitive performance and often outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.

Revisiting the Past: Data Unlearning with Model State History

TL;DR

This paper tackles data-level unlearning in large language models by introducing Model State Arithmetic (MSA), a post-hoc method that leverages intermediate checkpoints to compute a forget vector and merge it into the target model. By formulating unlearning as arithmetic in parameter space, MSA aims to reproduce the behavior of a model trained without the forget data, while preserving non-forget data performance. Across TOFU, RESTOR, and MUSE-Books benchmarks, MSA consistently matches or surpasses prior unlearning methods, demonstrating strong forgetting, recovery, and utility preservation. The approach is practical because it reuses existing model state history and remains effective even when the checkpoint predates the forget targets by a substantial amount, enabling data erasure in real-world, regulated settings such as the EU Right to Be Forgotten.

Abstract

Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints on a model through complete retraining -- by repeatedly pretraining the model on datasets that exclude these specific instances -- is computationally prohibitive. To address this, unlearning algorithms have been proposed, that aim to eliminate the influence of particular datapoints at a low computational cost, while leaving the rest of the model intact. However, precisely unlearning the influence of data on a large language model has proven to be a major challenge. In this work, we propose a new algorithm, MSA (Model State Arithmetic), for unlearning datapoints in large language models. MSA utilizes prior model checkpoints -- artifacts that record model states at different stages of pretraining -- to estimate and counteract the effect of targeted datapoints. Our experimental results show that MSA achieves competitive performance and often outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.

Paper Structure

This paper contains 45 sections, 6 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Our proposed framework Msa. Training proceeds over several steps, beginning from an initial model. When the final model $\theta_\mathcal{D}$ is obtained, the unlearning documents $\mathcal{D}_\text{f}$ have been unintentionally introduced during training. At an intermediate checkpoint $C$, prior to the introduction of unlearning targets, we extract a forget vector$\vec{\theta}_\text{f}$ that captures how $\mathcal{D}_\text{f}$ influences the model. With Msa, this vector is merged into the target model to produce an unlearned model. Unlike existing unlearning methods that operate solely on the final model checkpoint, Msa leverages earlier training dynamics to more effectively remove the influence of $\mathcal{D}_\text{f}$. Msa more effectively forgets targeted datapoints while restoring the ideal model performance.
  • Figure 2: Examples from TOFU’s forget set, showing the groundtruth, the ideal output, and the output of Msa (using Llama-3.1-8B-Instruct model). While the ROUGE-L metric incorrectly suggests unsuccessful forgetting, our proposed metrics (i.e., $\text{Acc}_\text{forget}$ and $\text{Acc}_\text{recover}$) demonstrate that forgetting is correctly done and additionally, the ideal output is successfully recovered.
  • Figure 3: Examples from TOFU’s retain set, showing the groundtruth, the ideal output, and the output of Msa (using Llama-3.1-8B-Instruct model). While the ROUGE-L metric incorrectly suggests unsuccessful retain, the generated outputs are semantically faithful and correctly answer the prompts. Our proposed metric $\text{Acc}_\text{retain}$ more accurately captures this alignment.