DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang; Chris Yuhao Liu; Quan Liu; Jinglong Pang; Wei Wei; Yujia Bao; Yang Liu

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, Yang Liu

TL;DR

DRAGON tackles practical LLM unlearning by introducing a training-free, in-context framework that uses a detection module and a CoT-based guard to steer inference without fine-tuning or retaining data. It formalizes sample and concept unlearning for black-box models and proposes new metrics (Refusal Quality, Dynamic Deviation Score, Dynamic Utility Score) to assess forgetting efficacy and utility over time. Across hazardous knowledge (WMDP) and private data (TOFU) tasks, DRAGON consistently outperforms strong baselines, achieving high refusal quality, near-random forgetting on targeted content, and preserved language utility. The approach demonstrates scalability and generality across model sizes and architectures, with robust continual unlearning performance and insightful ablation analyses. Practical impact includes safer deployment of LLMs in privacy- and safety-critical domains, though limitations such as inference latency and reliance on controlled access to unlearn stores are noted.

Abstract

Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

TL;DR

Abstract

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)