Table of Contents
Fetching ...

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis

Yicheng Lang, Kehan Guo, Yue Huang, Yujun Zhou, Haomin Zhuang, Tianyu Yang, Yao Su, Xiangliang Zhang

TL;DR

The paper tackles the inadequacy of single-value metrics for assessing LLM unlearning by introducing UNCD, a Cognitive Diagnosis Modeling-based framework that enables fine-grained forgetting evaluation. It provides UNCD-Cyber, a cyberattack-focused benchmark, and UNCD-Agent, a data-generation approach for targeted forgetting. Across two base models and eight unlearning methods, UNCD exposes residual harmful knowledge that QA metrics overlook, and demonstrates improved forgetting when CDM-guided diagnostics inform data generation. This granular evaluation approach offers a practical pathway to safer, more effective unlearning in LLMs via iterative refinement.

Abstract

Due to the widespread use of LLMs and the rising critical ethical and safety concerns, LLM unlearning methods have been developed to remove harmful knowledge and undesirable capabilities. In this context, evaluations are mostly based on single-value metrics such as QA accuracy. However, these metrics often fail to capture the nuanced retention of harmful knowledge components, making it difficult to assess the true effectiveness of unlearning. To address this issue, we propose UNCD (UNlearning evaluation via Cognitive Diagnosis), a novel framework that leverages Cognitive Diagnosis Modeling for fine-grained evaluation of LLM unlearning. Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities. Moreover, we introduce UNCD-Agent, which refines unlearning by diagnosing knowledge remnants and generating targeted unlearning data. Extensive experiments across eight unlearning methods and two base models demonstrate that UNCD not only enhances evaluation but also effectively facilitates the removal of harmful LLM abilities.

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis

TL;DR

The paper tackles the inadequacy of single-value metrics for assessing LLM unlearning by introducing UNCD, a Cognitive Diagnosis Modeling-based framework that enables fine-grained forgetting evaluation. It provides UNCD-Cyber, a cyberattack-focused benchmark, and UNCD-Agent, a data-generation approach for targeted forgetting. Across two base models and eight unlearning methods, UNCD exposes residual harmful knowledge that QA metrics overlook, and demonstrates improved forgetting when CDM-guided diagnostics inform data generation. This granular evaluation approach offers a practical pathway to safer, more effective unlearning in LLMs via iterative refinement.

Abstract

Due to the widespread use of LLMs and the rising critical ethical and safety concerns, LLM unlearning methods have been developed to remove harmful knowledge and undesirable capabilities. In this context, evaluations are mostly based on single-value metrics such as QA accuracy. However, these metrics often fail to capture the nuanced retention of harmful knowledge components, making it difficult to assess the true effectiveness of unlearning. To address this issue, we propose UNCD (UNlearning evaluation via Cognitive Diagnosis), a novel framework that leverages Cognitive Diagnosis Modeling for fine-grained evaluation of LLM unlearning. Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities. Moreover, we introduce UNCD-Agent, which refines unlearning by diagnosing knowledge remnants and generating targeted unlearning data. Extensive experiments across eight unlearning methods and two base models demonstrate that UNCD not only enhances evaluation but also effectively facilitates the removal of harmful LLM abilities.

Paper Structure

This paper contains 21 sections, 2 equations, 22 figures, 4 tables, 1 algorithm.

Figures (22)

  • Figure 1: Comparison of single-value (QA accuracy) and UNCD evaluation for LLM ability unlearning. GA thudi2022unrolling and NPO zhang2024negative, two unlearning methods, do have reduced QA accuracy, but UNCD reveals persistent knowledge concepts in unlearned models, highlighting the limitations of relying on a single aggregate metric.
  • Figure 2: Overview of UNCD. (Top) The data construction pipeline and dataset examples. (Bottom) The evaluation process. LLMs, before and after unlearning, are evaluated using precise or training-free diagnosis, revealing their knowledge stage.
  • Figure 3: Variations of knowledge states $F_s$ at four unlearn steps as Llama-3-8B undergoes GA GDR, NPO GDR, GA KLR and NPO KLR.
  • Figure 4: Forget and retain knowledge states of Llama-3-8B and Mistral-7B under unlearning. Forget knowledge states are diagnosed by the NCDM model, while retain knowledge states are measured by average accuracy (Acc) on UNCD-Cyber Evaluation Dataset.
  • Figure 5: Few-shot diagnosis results of Llama-3-8B unlearned with NPO and NPO GDR.
  • ...and 17 more figures