Table of Contents
Fetching ...

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu

TL;DR

It is found that unlearning leaves measurable signatures in LLMs, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Abstract

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

TL;DR

It is found that unlearning leaves measurable signatures in LLMs, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Abstract

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Paper Structure

This paper contains 36 sections, 8 equations, 13 figures, 20 tables.

Figures (13)

  • Figure 1: Schematic overview of unlearning trace detection. Original and unlearned models are queried with forget-relevant and irrelevant prompts, and observable model outputs are collected, either discrete textual responses or continuous pre-logit activations. A lightweight classifier is then trained to predict whether the model has undergone unlearning. Detectable behavioral shifts in both outputs and internal representations indicate the presence of unlearning traces.
  • Figure 2: GPT-2 perplexity distributions for Yi-34B vs. RMU-unlearned responses. (a) WMDP forget queries (3,000 samples). (b) MMLU forget-irrelevant queries (3,000 samples). Where perplexity quantifies fluency and predictability.
  • Figure 3: Radar charts of unlearning trace detection accuracy across four source LLMs evaluated on three test sets (WMDP, MMLU, UltraChat). Panel (a) reports results for models unlearned on WMDP using RMU, while panel (b) reports results for models unlearned on WMDP using NPO. Each axis corresponds to a detection setting defined by a source model $A$ and a test dataset $B$, where the classifier is trained on outputs from model $A$ and evaluated on prompts from dataset $B$. Results are shown for two output types: text-based responses (blue) and pre-logit activations (orange). Detailed numerical results are provided in Appendix \ref{['app: impro_cls']}.
  • Figure 4: Projection of activations from various layers for 3000 responses to MMLU onto the top right singular vectors (denoted as SV1) for both original and unlearned models. Here, $\mathrm L_{i}$.d_proj refers to activations extracted from the down-projection sublayer of the FFN in the $i$-th transformer block, while final denotes the activations of the final layer after RMS-norm. (a,d) for Zephyr-7B, (b,e) for LLaMA3.1-8B, and (c,f) for Yi-34B.
  • Figure 5: Projection of the final-layer normalized activations from 3,000 MMLU responses onto the first right singular vector (SV1) for the original LLaMA3.1-8B model and its NPO-unlearned counterpart.
  • ...and 8 more figures