Table of Contents
Fetching ...

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

TL;DR

This work questions the assumption that exact unlearning fully mitigates privacy leakage in LLMs by showing that an adversary with access to both pre- and post-unlearning checkpoints or logits can mount data-extraction attacks. The authors introduce reversed model guidance and a token-filtering scheme to leverage the differences between model states, significantly improving extraction of forgetting data on benchmarks like MUSE, TOFU, and WMDP, as well as on a synthetic medical dataset. Across extensive experiments, the approach often doubles exact-match extraction rates and demonstrates practical privacy risks in real-world scenarios. The results motivate evaluating unlearning methods under broader threat models that include adversarial access to prior states and suggest directions for defenses that preserve utility while protecting forgotten information.

Abstract

Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

TL;DR

This work questions the assumption that exact unlearning fully mitigates privacy leakage in LLMs by showing that an adversary with access to both pre- and post-unlearning checkpoints or logits can mount data-extraction attacks. The authors introduce reversed model guidance and a token-filtering scheme to leverage the differences between model states, significantly improving extraction of forgetting data on benchmarks like MUSE, TOFU, and WMDP, as well as on a synthetic medical dataset. Across extensive experiments, the approach often doubles exact-match extraction rates and demonstrates practical privacy risks in real-world scenarios. The results motivate evaluating unlearning methods under broader threat models that include adversarial access to prior states and suggest directions for defenses that preserve utility while protecting forgotten information.

Abstract

Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

Paper Structure

This paper contains 27 sections, 4 equations, 26 figures, 6 tables.

Figures (26)

  • Figure 1: An example from our experiments illustrating how real-world patient information can be extracted using some side information. When the pre-unlearning checkpoint is accessible, our method—leveraging both pre- and post-unlearning checkpoints—extracts significantly more information than the baseline which uses only the pre-unlearning checkpoint. Red highlights indicate correctly extracted content.
  • Figure 2: Visualization of reversed model guidance. We combine predictions from the pre- and post-unlearning models to approximate the forgotten distribution $q(x_{i+1}|x_{\leq i})$, resulting in a more effective extraction attack.
  • Figure 3: Comparison of our extraction method and the baseline on MUSE using Phi-1.5, evaluated at 3 epochs across different forgetting set ratios.
  • Figure 4: Comparison of our extraction method and the baseline on MUSE using Phi-1.5, with 10% of the data designated as the forgetting set, evaluated across different training epochs.
  • Figure 5: Extraction performance under different guidance scales $w$ on MUSE using Phi-1.5, evaluated with a 10% forgetting set size.
  • ...and 21 more figures