Table of Contents
Fetching ...

Explainable LLM Unlearning Through Reasoning

Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen, Zhen Fang

TL;DR

This study introduces a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response, and proposes targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance.

Abstract

LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.

Explainable LLM Unlearning Through Reasoning

TL;DR

This study introduces a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response, and proposes targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance.

Abstract

LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.
Paper Structure (41 sections, 12 equations, 19 figures, 7 tables)

This paper contains 41 sections, 12 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: The overall paradigm of TRU (our method) and supplementary details. (a) Depicts the unlearning scope of the WMDP-Bio benchmark li2024wmdp, which focuses on content implying harmful biological information. (b) Illustrates the paradigms of TRU and prior unlearning methods for direct comparison. (c) Presents evaluation results of TRU and one of prior methods zhang2024negative on the WMDP dataset, quantifying their performance after unlearning.
  • Figure 2: Prompt template for generation of reasoning targets using advanced reasoning models.
  • Figure 2: Average results of ablation studies on WMDP-Bio and TOFU-Forget05.
  • Figure 3: Robustness of TRU against various attacks on the WMDP-Bio dataset.
  • Figure 4: The sensitivity of hyperparameter $\alpha$ on TOFU benchmark.
  • ...and 14 more figures