Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

Sungmin Cha; Sungjun Cho; Dasol Hwang; Moontae Lee

Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

Sungmin Cha, Sungjun Cho, Dasol Hwang, Moontae Lee

TL;DR

This work tackles privacy- and copyright-related memorization in LLMs by introducing Low-rank Knowledge Unlearning (LoKU), a framework that combines a new unlearning objective, Inverted Hinge Loss (IHL), with Fisher-information–guided initialization of low-rank adapters (FILA). IHL provides stable, targeted suppression of unwanted tokens while maintaining fluency, addressing the instability and over-forgetting seen with Gradient Ascent. FILA concentrates updates on parameters most responsible for generating memorized content, enabling efficient LoRA-based unlearning. Empirical results on the Training Data Extraction Challenge and TOFU demonstrate that LoKU can achieve robust forgetting with minimal degradation to reasoning and generation, while maintaining parameter efficiency relative to full fine-tuning.

Abstract

Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora. However, this poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods that remove sensitive data without retraining from scratch. While Gradient Ascent (GA) is commonly used to unlearn by reducing the likelihood of generating unwanted content, it leads to unstable optimization and catastrophic forgetting of retrained knowledge. We find that combining GA with low-rank adaptation results in poor trade-offs between computational cost and generative performance. To address these challenges, we propose Low-rank Knowledge Unlearning (LoKU), a novel framework that enables robust and efficient unlearning for LLMs. First, we introduce Inverted Hinge Loss, which suppresses unwanted tokens while maintaining fluency by boosting the probability of the next most likely token. Second, we develop a data-adaptive initialization for LoRA adapters via low-rank approximation weighted with relative Fisher information, thereby focusing updates on parameters critical for removing targeted knowledge. Experiments on the Training Data Extraction Challenge dataset using GPT-Neo models as well as on the TOFU benchmark with Phi-1.5B and Llama2-7B models demonstrate that our approach effectively removes sensitive information while maintaining reasoning and generative capabilities with minimal impact. Our implementation can be found in https://github.com/csm9493/efficient-llm-unlearning.

Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

TL;DR

Abstract

Paper Structure (19 sections, 22 equations, 7 figures, 9 tables)

This paper contains 19 sections, 22 equations, 7 figures, 9 tables.

Introduction
Related Work
Proposed Method: Low-rank Knowledge Unlearning (LoKU)
Preliminaries
Preliminary Results
Inverted Hinge Loss: A Novel Loss Function for LLM Unlearning
FILA: a Novel LoRA Initialization for LLM Unlearning
Final Loss Function of LoKU
Experiments
Training Data Extraction Challenge
Analysis
Task of Fictitious Unlearning
Concluding Remarks
Derivative Analysis for the Inverted Hinge Loss Function
Evaluation Metrics
...and 4 more sections

Figures (7)

Figure 1: LLM unlearning aims to forget data points in ${\mathcal{D}}_f$ while maintaining knowledge of the retain set ${\mathcal{D}}_r$. Unlike GA, our IHL induces higher unlearning stability by reducing the likelihood of unwanted tokens in a controlled manner. To accelerate unlearning with IHL, FILA extracts and places parameters important in generating $\mathcal{D}_f$ to LoRA weights a priori via weighted low-rank approximation. IHL and FILA form a powerful synergy towards robust and efficient LLM unlearning.
Figure 2: Compute cost for successful unlearning vs. post-unlearning downstream performances. We unlearn 32 randomly sampled sequences in the Training Data Extraction Challenge from GPT-Neo-125M. Each point represents a different forget set and LoRA rank (if used). Left: Accuracy averaged across 9 classification tasks (higher is better). Middle: F1 score averaged across 4 dialogue generation tasks (higher is better). Right: Perplexity on the validation set of the Pile dataset (lower is better). Dashed lines indicate the performances of the model prior to unlearning. Unlearning via gradient differences (GD) with vanilla LoRA leads to significant loss in performance compared to full-parameter GD unlearning due to lack of plasticity. However, our proposed LoKU using both the Inverted Hinge Loss and Fisher-weighted LoRA initialization performs competitively to unlearning via full-finetuning in all three aspects while enjoying the parameter-efficiency of LoRA.
Figure 3: Results from unlearning examples in the TDEC dataset on the GPT-Neo LLM family. Each row represents the performance averaged across datasets within each set of LLM capability tests: Reasoning (higher is better), Dialogue (higher is better), and Perplexity (lower is better). The circles and crosses represent successful and unsuccessful attempts, respectively, of unlearning a particular forget set ${\mathcal{D}}_f$. Solid lines indicate the performance of different methods averaged only across successful unlearning trials. The dashed lines indicate the base model performance prior to unlearning. GD leads to significant loss in performance and also fails to unlearn in some cases even with large LoRA ranks. Replacing the NCE loss in GD with IHL boosts retention of reasoning and generation capabilities, but still fails to unlearn in multiple cases. Running GD with FILA notably increases the rate of unlearning success, but at significant cost in overall performance. Our LoKU using both IHL and FILA best minimizes post-unlearning performance degradation in all aspects.
Figure 4: Results from unlearning examples from TDEC dataset using LoRA with rank 32 to adapt sets of layers on GPT-Neo-125M. The marker shapes and colors are used similarly as in \ref{['fig:tdec_results']}. Based on the rate of unlearning success, tuning FFN layers (e.g., FFN, QVFFN) is more receptive to targeted knowledge removal compared to tuning attention layers (e.g., QV, QKVO).
Figure 5: TOFU results using Phi-1.5B and Llama2-7B models. Each row corresponds to unlearning a different forget set (1%, 5%, or 10%), and each column uses a distinct LoRA rank between 4 and 32. The relative size of markers represent the number of epochs. Ideally, the unlearning curves should start from the pretrained model ($\blacksquare$) and approach towards the reference model tuned on the retain set only ($\bigstar$) as unlearning progresses. Both GD and GD+FILA suffers from significant loss of model utility due to using GA for unlearning. Replacing GA with IHL largely retains model utility, then our LoKU initializing LoRA adapters with FILA significantly boosts the unlearning efficiency of IHL.
...and 2 more figures

Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

TL;DR

Abstract

Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)