Table of Contents
Fetching ...

Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, Sijia Liu

TL;DR

This paper identifies a fundamental safety gap in unlearning for large reasoning models: erasing only final answers leaves sensitive information embedded in intermediate reasoning traces. It introduces Reasoning-aware Representation Misdirection Unlearning (R2MU), which jointly suppresses reasoning traces linked to forget data and preserves reasoning ability through CoT supervision drawn from a high-quality CoT corpus. The approach extends existing RMU by targeting CoT representations and leveraging augmented supervision to maintain reasoning performance, achieving strong improvements in reasoning-trace unlearning (RT-UA) and safety on WMDP and STAR-1 benchmarks, with acceptable trade-offs in utility. The work provides a practical path toward safer LRMs in high-stakes applications, while acknowledging limitations in hyperparameter tuning, theoretical guarantees, and robustness to adversarial settings.

Abstract

Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.

Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

TL;DR

This paper identifies a fundamental safety gap in unlearning for large reasoning models: erasing only final answers leaves sensitive information embedded in intermediate reasoning traces. It introduces Reasoning-aware Representation Misdirection Unlearning (R2MU), which jointly suppresses reasoning traces linked to forget data and preserves reasoning ability through CoT supervision drawn from a high-quality CoT corpus. The approach extends existing RMU by targeting CoT representations and leveraging augmented supervision to maintain reasoning performance, achieving strong improvements in reasoning-trace unlearning (RT-UA) and safety on WMDP and STAR-1 benchmarks, with acceptable trade-offs in utility. The work provides a practical path toward safer LRMs in high-stakes applications, while acknowledging limitations in hyperparameter tuning, theoretical guarantees, and robustness to adversarial settings.

Abstract

Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning (), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.

Paper Structure

This paper contains 24 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Demonstration of LRM unlearning challenges. (a) Final answer unlearning effectiveness, measured by accuracy on the WMDP evaluation set, for both RMU-unlearned LLM (Qwen2.5-14B) and unlearned LRM (DeepSeek-R1-Distill-Qwen-14B), compared to their pre-unlearned counterparts. (b) Generation examples from the unlearned LLM and LRM on WMDP, highlighting differences in final answer unlearning and residual sensitive content in reasoning traces. (c) Reasoning ability degradation, measured by accuracy of the original and RMU/NPO-unlearned LRM (DeepSeek-R1-Distill-Qwen-14B) on AIME 2024, MATH-500, and GPQA Diamond benchmarks.
  • Figure 2: Distribution of reasoning traces into unthinking categories (C1–C4) on the WMDP benchmark after applying RMU for LRM (R1-Distill-LLaMA-8B) unlearning. Categories C2–C4 indicate varying levels of sensitive information leakage, while only C1 is considered successful unthinking. 19.7% of evaluation samples fall into C2–C4, indicating unsafe forgetting.
  • Figure 3: Category-wise distribution of RMU, RMU w/ ZT, and RMU w/ RTP on WMDP using LRM (R1-Distill-LLaMA-8B), evaluated by GPT-o3-mini. Cases are grouped into C1–C4 by sensitivity leakage, where C1 indicates successful unthinking and C2–C4 reflect varying failure levels.
  • Figure A1: Reasoning trace unlearning accuracy (RT-UA) comparison between RMU and R2MU on WMDP dataset, using DeepSeek-R1-Distill-Qwen-14B across all judge models and prompts. RT-UA results remain highly consistent across different judge models (o3-mini, o1, o4-mini) and prompt configurations (4-Class and 2-Class), validating the robustness of LLM-as-judge protocol.
  • Figure A2: Reasoning trace leakage evaluation (TraceLeak@K) comparison between RMU and R2MU on WMDP dataset, across DeepSeek-R1-Distill-Qwen-8B and DeepSeek-R1-Distill-Qwen-14B with 4-Class LLM-as-judge.
  • ...and 1 more figures