Table of Contents
Fetching ...

KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models

Ce Fang, Zhikun Zhang, Min Chen, Qing Liu, Lu Zhou, Zhe Liu, Yunjun Gao

TL;DR

KUDA tackles the risk of memorized sensitive, copyrighted, or harmful knowledge in large language models by introducing a representation-level unlearning method that targets knowledge storage in FFN layers. It combines causal tracing to identify unlearning layers, a knowledge-representation deviation loss to forget target knowledge, and a relaxation null-space projection to preserve retained knowledge, with a principled two-stage hyperparameter tuning strategy. The approach achieves strong forgetting with minimal utility degradation and generalizes across modern models, supported by mechanistic gradient analyses showing near-orthogonal forgetting and retention directions. This work advances intrinsic safety for LLMs by enabling precise, robust unlearning that goes beyond output-level filtering or generic parameter deletion.

Abstract

Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora. While this endows LLMs with strong capabilities in generation and reasoning, it amplifies risks associated with sensitive, copyrighted, or harmful content in training data. LLM unlearning, which aims to remove specific knowledge encoded within models, is a promising technique to reduce these risks. However, existing LLM unlearning methods often force LLMs to generate random or incoherent answers due to their inability to alter the encoded knowledge precisely. To achieve effective unlearning at the knowledge level of LLMs, we propose Knowledge Unlearning by Deviating representAtion (KUDA). We first utilize causal tracing to locate specific layers for target knowledge storage. We then design a new unlearning objective that induces the model's representations to deviate from its original position in the phase of knowledge removal, thus disrupting the ability to associate with the target knowledge. To resolve the optimization conflicts between forgetting and retention, we employ a relaxation null-space projection mechanism to mitigate the disruption to the representation space of retaining knowledge. Extensive experiments on representative benchmarks, WMDP and MUSE, demonstrate that KUDA outperforms most existing baselines by effectively balancing knowledge removal and model utility retention.

KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models

TL;DR

KUDA tackles the risk of memorized sensitive, copyrighted, or harmful knowledge in large language models by introducing a representation-level unlearning method that targets knowledge storage in FFN layers. It combines causal tracing to identify unlearning layers, a knowledge-representation deviation loss to forget target knowledge, and a relaxation null-space projection to preserve retained knowledge, with a principled two-stage hyperparameter tuning strategy. The approach achieves strong forgetting with minimal utility degradation and generalizes across modern models, supported by mechanistic gradient analyses showing near-orthogonal forgetting and retention directions. This work advances intrinsic safety for LLMs by enabling precise, robust unlearning that goes beyond output-level filtering or generic parameter deletion.

Abstract

Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora. While this endows LLMs with strong capabilities in generation and reasoning, it amplifies risks associated with sensitive, copyrighted, or harmful content in training data. LLM unlearning, which aims to remove specific knowledge encoded within models, is a promising technique to reduce these risks. However, existing LLM unlearning methods often force LLMs to generate random or incoherent answers due to their inability to alter the encoded knowledge precisely. To achieve effective unlearning at the knowledge level of LLMs, we propose Knowledge Unlearning by Deviating representAtion (KUDA). We first utilize causal tracing to locate specific layers for target knowledge storage. We then design a new unlearning objective that induces the model's representations to deviate from its original position in the phase of knowledge removal, thus disrupting the ability to associate with the target knowledge. To resolve the optimization conflicts between forgetting and retention, we employ a relaxation null-space projection mechanism to mitigate the disruption to the representation space of retaining knowledge. Extensive experiments on representative benchmarks, WMDP and MUSE, demonstrate that KUDA outperforms most existing baselines by effectively balancing knowledge removal and model utility retention.
Paper Structure (43 sections, 21 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 43 sections, 21 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: FFN architectures in Transformer-based models.
  • Figure 2: The overview of KUDA pipeline to balance the knowledge removal and utility retention for LLMs. KUDA includes three major stages: 1) We employ causal tracing and causal effect to identify FFNs critical for knowledge storage, and apply a sliding window to select the target unlearning layers for parameter updates. 2) The representations generated from the last unlearning layer are captured and utilized to derive the update gradients $\nabla \mathcal{L}_u$ based on the mechanism of knowledge representation deviation. 3) The gradients are projected into the relaxation null-space of retaining knowledge, and ${\nabla' \mathcal{L}_u}$ is then applied to the last linear transformation matrix of FFN in each unlearning layer. $R_{forget}$ and $R_{retain}$ are the representations of forgetting and retaining knowledge. $NS_{retain}$ denotes the relaxation null-space of retaining knowledge.
  • Figure 3: t-SNE visualization of representations on Wikitext and MUSE-News.
  • Figure 4: Evolution of angles between two gradients during the unlearning with RMU and KUDA. We trained on WMDP for 600 steps, showcasing the gradient angle fluctuations throughout the entire process.
  • Figure 5: Unlearning performance heatmap across different hyperparameter combinations of $\beta$ (vertical axis) and $\tau$ (horizontal axis). The left plot presents forgetting quality with KRD, and the right one shows retaining utility via MMLU score. Darker colors indicate better performance.
  • ...and 7 more figures