Table of Contents
Fetching ...

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, Mingyi Hong

TL;DR

This work shows that existing LLM unlearning approaches often appear effective when evaluated with greedy decoding but fail to forget under realistic probabilistic decoding. It introduces leak@$k$, a meta-metric that quantifies worst-case information leakage across $k$ generations using core metrics like ROUGE-L, Cosine Similarity, Entailment Score, Accuracy, and LLM-based judgments. Through large-scale evaluation on TOFU, MUSE, and WMDP across varying decoding settings, the authors demonstrate that leakage generally rises with $k$ and is strongly influenced by top-$p$, revealing brittleness in current unlearning methods. They propose a simple mitigation, NPO-Fix, which augments the forget set with detected leakage instances, but results indicate that stronger, more principled solutions are still needed to achieve reliable forgetting without sacrificing utility.

Abstract

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning.

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

TL;DR

This work shows that existing LLM unlearning approaches often appear effective when evaluated with greedy decoding but fail to forget under realistic probabilistic decoding. It introduces leak@, a meta-metric that quantifies worst-case information leakage across generations using core metrics like ROUGE-L, Cosine Similarity, Entailment Score, Accuracy, and LLM-based judgments. Through large-scale evaluation on TOFU, MUSE, and WMDP across varying decoding settings, the authors demonstrate that leakage generally rises with and is strongly influenced by top-, revealing brittleness in current unlearning methods. They propose a simple mitigation, NPO-Fix, which augments the forget set with detected leakage instances, but results indicate that stronger, more principled solutions are still needed to achieve reliable forgetting without sacrificing utility.

Abstract

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \texttt{leak@} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning.

Paper Structure

This paper contains 15 sections, 16 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: $\widehat{\texttt{leak@$k$}\xspace}$ measure using ROUGE-L score ($\widehat{\text{leak@$k$}\xspace}$--RS) for various unlearned models on MUSE-News dataset using LLaMA2-7B model at $T\!=\!0.2$ and $p\!=\!1.0$. When $k$ is small, the unlearned models show limited leakage in providing information from the forget set. However, as $k$ increases, all models reveal increasingly sensitive information about the forget set questions.
  • Figure 2: $\widehat{\text{leak@$k$}\xspace}$--ES heatmaps for unlearning methods on the TOFU benchmark with LLaMA-3.2-1B. Each cell reports ES across $k$ generations. Rows denote unlearning methods, columns denote values of $k$, and each plot corresponds to a different $(\text{temperature}, \text{top-}p)$ configuration. Leakage is almost stable at $(0.2,0.2)$ but increases with larger $p$ values, even when temperature remains low, whereas high $T$ with low $p$ does not produce explicit leakage.
  • Figure 3: $\widehat{\text{leak@$k$}\xspace}$--LJ heatmaps for unlearning methods on the TOFU benchmark with LLaMA-3.2-1B using two sampling configurations $(T,p)=(0.2,0.2)$ and $(T,p)=(1.0,1.0)$. A slight rise appears at low randomness, while LJ confirms explicit information leakage under high-randomness decoding.
  • Figure 4: $\widehat{\text{leak@$k$}\xspace}$--RS heatmaps for various unlearning methods evaluated on the MUSE-News benchmark using the LLaMA2-7B model. Each heatmap cell represents ROUGE-L recall achieved across $k$ generations. Rows correspond to different unlearning methods, and columns represent the number of generations $k$. Each plot varies in sampling configuration (temperature, top-$p$).
  • Figure 5: $\widehat{\text{leak@$k$}\xspace}$--RS for RMU model on the WMDP dataset using Zephyr-7B-beta model. Rows correspond to different pairs of $(T,p)$.
  • ...and 7 more figures