Table of Contents
Fetching ...

Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions

Junbeom Kim, Kyuyoung Kim, Jihoon Tack, Dongha Lim, Jinwoo Shin

TL;DR

The paper tackles the challenge of preventing sensitive knowledge leakage from large language models by shifting focus from input-based suppression to output revision. It proposes CURE, a retrieval-augmented unlearning framework that attaches a parameter-efficient corrector via LoRA to verify and rewrite leaked responses using retrieved exclusions. By retrieving the most relevant unlearning targets and employing a two-stage curriculum (leakage detection and suppression reinforcement), CURE achieves substantial leakage reduction while preserving utility, and demonstrates robustness under continual unlearning. Empirical results across TOFU, WMDP, and MMLU show CURE outperforms fine-tuning and guardrail baselines, with practical inference overhead and broad generalization across domains.

Abstract

Language models trained on web-scale corpora risk memorizing and exposing sensitive information, prompting the need for effective machine unlearning. Prior methods mainly focus on input queries to suppress sensitive outputs, yet this often fails to eliminate the underlying knowledge and limits scalability. To address this, we propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework that verifies model outputs for leakage and revises them into safe responses. Specifically, CURE employs a lightweight corrector that is applied to the original model to verify whether outputs contain target knowledge and to rewrite them if any leakage is detected. To efficiently handle large-scale unlearning requests, CURE retrieves unlearning targets that are relevant to the initial response and provides them as in-context references to the corrector for detection and conditional revision. By leveraging this retrieval augmentation, the corrector can adapt to new unlearning requests without additional training. Extensive evaluations demonstrate that CURE substantially reduces information leakage, even from indirect queries where prior works fall short, while maintaining response quality and general utility. Moreover, it demonstrates robustness under continual unlearning scenarios, making it practical for real-world applications.

Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions

TL;DR

The paper tackles the challenge of preventing sensitive knowledge leakage from large language models by shifting focus from input-based suppression to output revision. It proposes CURE, a retrieval-augmented unlearning framework that attaches a parameter-efficient corrector via LoRA to verify and rewrite leaked responses using retrieved exclusions. By retrieving the most relevant unlearning targets and employing a two-stage curriculum (leakage detection and suppression reinforcement), CURE achieves substantial leakage reduction while preserving utility, and demonstrates robustness under continual unlearning. Empirical results across TOFU, WMDP, and MMLU show CURE outperforms fine-tuning and guardrail baselines, with practical inference overhead and broad generalization across domains.

Abstract

Language models trained on web-scale corpora risk memorizing and exposing sensitive information, prompting the need for effective machine unlearning. Prior methods mainly focus on input queries to suppress sensitive outputs, yet this often fails to eliminate the underlying knowledge and limits scalability. To address this, we propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework that verifies model outputs for leakage and revises them into safe responses. Specifically, CURE employs a lightweight corrector that is applied to the original model to verify whether outputs contain target knowledge and to rewrite them if any leakage is detected. To efficiently handle large-scale unlearning requests, CURE retrieves unlearning targets that are relevant to the initial response and provides them as in-context references to the corrector for detection and conditional revision. By leveraging this retrieval augmentation, the corrector can adapt to new unlearning requests without additional training. Extensive evaluations demonstrate that CURE substantially reduces information leakage, even from indirect queries where prior works fall short, while maintaining response quality and general utility. Moreover, it demonstrates robustness under continual unlearning scenarios, making it practical for real-world applications.

Paper Structure

This paper contains 32 sections, 9 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Limitations of existing unlearning methods.Red text marks information to unlearn, and blue text indicates safe content. (a) When responding to explicitly unlearned questions, fine-tuning methods such as RMU li2024wmdp degrade Llama3.1-8B's ability to produce valid responses, and guardrail-based methods like ECO liu2024large also lose coherence. (b) Moreover, both methods fail to fully remove the target knowledge, which can be revealed through indirect questions.
  • Figure 2: Overview of CURE. Given a query $x$, the base model $\mathcal{M}_{\theta}$ first produces a draft response $y_{0}$ that may contain private or undesired knowledge. CURE operates in two stages: (1) Draft-based retrieval: The pair $(x, y_{0})$ is used to query an unlearning-target database $\mathcal{K}$, retrieving the most relevant exclusions $\mathcal{K}^{\mathtt{retr}}$. (2) Response correction: A parameter-efficiently tuned corrector$\phi$ is applied at inference time, conditioning on $(x, y_{0}, \mathcal{K}^{\mathtt{retr}})$, to detect leakage and rewrite the response, producing the final safe output $y^{\!*}$ while preserving $\mathcal{M}_{\theta}$'s general knowledge.
  • Figure 3: Performance comparison of unlearning methods on TOFU. The figures report (a) leakage rate under direct queries versus utility, (b) leakage rate under indirect queries versus utility, and (c) leakage rate under overall queries versus the response plausibility. For interpretability, we set the original model’s leakage rate, utility, and plausibility to 100%, and plot all other methods relative to these values. We present detailed results in Appendix \ref{['app:results']}.
  • Figure 4: Continual unlearning performance. The figures show changes in (a) model utility, (b) plausibility, and (c) leakage rate over 20 successive unlearning requests; the leakage rate is averaged across direct and indirect queries. All values are normalized to the original model (100%). We compare our method with NPO zhang2024negative and RMU li2024wmdp.
  • Figure 5: Example of leaked response from retain model on TOFU. The retain model, despite not explicitly learning from the sample, generates a response reflecting learned biases, causing knowledge leakage. In contrast, CURE explicitly revises the original response to prevent any leakage, highlighting the fundamental difference in the goals of CURE and the retain model.
  • ...and 6 more figures