Table of Contents
Fetching ...

A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty

Xiaohua Feng, Yuyuan Li, Chengye Wang, Junlin Liu, Li Zhang, Chaochao Chen

TL;DR

This work tackles the challenge of unlearning in Large Language Models by introducing Memory Removal Difficulty (MRD), a neuro-inspired, sample-level metric that quantifies how hard individual data samples are to forget under small parameter perturbations. MRD connects unlearning difficulty to local curvature via the Hessian, enabling a principled analysis of why some samples resist unlearning more than others. Building on MRD, the authors propose a curriculum-style, MRD-based weighted sampling method (CGA) that prioritizes easier-to-forget samples, improving both unlearning completeness and model utility while reducing computational burden. Extensive experiments across TOFU, WMDP, WHP, and SAFE demonstrate that MRD correlates with the required updates and that MRD-guided sampling yields meaningful gains over existing baselines. Overall, MRD offers a robust, interpretable lens for evaluating and improving LLM unlearning, with practical implications for privacy compliance and safer model deployments.

Abstract

Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm's design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty ($\mathrm{MRD}$) metric to quantify sample-level unlearning difficulty. Using $\mathrm{MRD}$, we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an $\mathrm{MRD}$-based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.

A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty

TL;DR

This work tackles the challenge of unlearning in Large Language Models by introducing Memory Removal Difficulty (MRD), a neuro-inspired, sample-level metric that quantifies how hard individual data samples are to forget under small parameter perturbations. MRD connects unlearning difficulty to local curvature via the Hessian, enabling a principled analysis of why some samples resist unlearning more than others. Building on MRD, the authors propose a curriculum-style, MRD-based weighted sampling method (CGA) that prioritizes easier-to-forget samples, improving both unlearning completeness and model utility while reducing computational burden. Extensive experiments across TOFU, WMDP, WHP, and SAFE demonstrate that MRD correlates with the required updates and that MRD-guided sampling yields meaningful gains over existing baselines. Overall, MRD offers a robust, interpretable lens for evaluating and improving LLM unlearning, with practical implications for privacy compliance and safer model deployments.

Abstract

Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm's design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty () metric to quantify sample-level unlearning difficulty. Using , we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an -based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.

Paper Structure

This paper contains 45 sections, 1 theorem, 13 equations, 5 figures, 7 tables, 2 algorithms.

Key Result

Theorem 3.2

Approximation of MRD. Assuming that $P_t(\boldsymbol{\theta})$ and $P_t(\boldsymbol{\theta}+\boldsymbol{\delta})$ are non-zero, and $\boldsymbol{\delta} \sim \mathcal{N}(0,\boldsymbol{\boldsymbol{\sigma}^2}I)$ represents a small perturbation where $\boldsymbol{\boldsymbol{\sigma}^2}$ is sufficiently where $H_t=\nabla^2P_t(\boldsymbol{\theta})$ represents the Hessian matrix of $P_t(\boldsymbol{\the

Figures (5)

  • Figure 1: Unlearning difficulty is measured by introducing small perturbations to model parameters (akin to minor brain injuries) and comparing the change in generation probability for a specific sample before and after perturbation. A small change indicates the sample resides in the model's long-term memory and is harder to unlearn, whereas a large change suggests easier unlearning.
  • Figure 2: Impact of sample selection on unlearning evaluation. We report the variability in performance across different LLM unlearning methods (GradDiff and NPO).
  • Figure 3: Comparison of unlearning difficulty across different sample sets in GA, GradDiff, and NPO. In these polar coordinates, samples are uniformly distributed in terms of angle, while the distance denotes the average absolute value of parameter changes.
  • Figure 4: The relationship between the MRD value and the number of unlearning updates (i.e., unlearning difficulty).
  • Figure 5: Parameter sensitivity of $\mathrm{MRD}$. (a) Effect of perturbation parameter $\boldsymbol{\delta}$, fluctuating around 0.64. (b) Effect of Monte Carlo samples $K$, with stability achieved at $K = 100$.

Theorems & Definitions (5)

  • Definition 3.1
  • Theorem 3.2
  • proof
  • Definition 3.3
  • Definition 3.4