UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

Yijiang River Dong; Hongzhou Lin; Mikhail Belkin; Ramon Huerta; Ivan Vulić

UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić

TL;DR

UnDIAL addresses privacy-related unlearning in LLMs by replacing loss-maximization with self-distillation using adjusted logits to down-weight forgotten tokens. It introduces a fixed target distribution and a focused variant (FUnDIAL) to emphasize key tokens like entities and nouns, improving trade-offs between forgetting and language usefulness. Experiments on the Extraction Data and MUSE benchmarks show UnDIAL achieving robust, scalable unlearning that outperforms GA, NPO, and auxiliary-model baselines, with stable training dynamics across hyperparameters. This work provides a practical approach to privacy-preserving unlearning for real-world LLM deployment, with noted limitations and avenues for extending to larger models and automatic sensitive-token detection.

Abstract

Mitigating the retention of sensitive or private information in large language models is essential for enhancing privacy and safety. Existing unlearning methods, like Gradient Ascent and Negative Preference Optimization, directly tune models to remove unwanted information. However, these methods often become unstable because they fine-tune by maximizing cross-entropy loss, which is the opposite of traditional loss minimization in learning. This reversal creates instability, especially on larger datasets, as the model struggles to balance unlearning with maintaining language capacity, leading to over-unlearning. In this paper, we introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens. This technique ensures smooth convergence and avoids catastrophic forgetting, even in challenging unlearning tasks with large datasets and sequential unlearning requests. Extensive experiments show that UnDIAL can achieve both robustness in unlearning and scalability while maintaining stable training dynamics and resilience to hyperparameter tuning.

UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

TL;DR

Abstract

Paper Structure (27 sections, 5 equations, 8 figures, 3 tables)

This paper contains 27 sections, 5 equations, 8 figures, 3 tables.

Introduction
Background and Related Work
Memorization in Large Language Models
Unlearning in Large Language Models
Methodology
UnDIAL: Method Description
Variant: Focused UnDIAL (FUnDIAL)
Case Study One: Extraction Data
Dataset and Model
Unlearning Metrics
'Model Usefulness' Metrics
Results and Discussion
Case Study Two: MUSE Benchmark
Conclusion
Appendix
...and 12 more sections

Figures (8)

Figure 1: An illustration of the self-distillation process in the proposed UnDIAL method: The original logits generated by the model are adjusted by subtracting the one-hot distribution of the target token. The student model is then fine-tuned to approximate this modified logit distribution. Since the adjustments rely solely on the original model’s outputs, this is a self-distillation process to de-emphasize the token to be forgotten.
Figure 2: Training dynamics of Direct Tuning methods on the MUSE benchmark shi2024muse. MUSE divides data into two sets: the Forget set, containing the information to be unlearned, and the Retain set, which measures the impact of unlearning on unrelated knowledge. Ideally, unlearning should be precise, affecting only the Forget set without disturbing the Retain set. MUSE provides fine-tuned models for both sets as optimal reference points. To capture the training dynamics, we compute the average KL divergence between the unlearned model and the MUSE reference models over the Forget and Retain sets. An effective unlearning model should closely match both references, with near-zero divergence indicating successful unlearning and model performance preservation.
Figure 3: UnDIAL versus baselines when performing unlearning on the GPT-Neo 125M model. The method with lower EL scores and higher MAUVE scores is considered better, i.e., towards the upper-right corner. For each of the methods, we vary the unlearning strength, naturally creating a curve of Pareto type showing the trade-off between memorization accuracy (EL) and language capacity (MAUVE).
Figure 4: Effectiveness of the Focused UnDIAL variant versus the basic variant. The left figure shows the EL vs Mauve trade-off after introducing entity and noun indicators in our method, see §\ref{['subsection: focus']}. By focusing on these specific tokens, we show that the performance can be further improved. The two figures on the right show the stable training dynamics for a given unlearning strength $\gamma$=30 across different variants.
Figure 5: Results on MUSE-News wih LLaMA-2 7B.Knowledge Memorization and Utility Preservation refer to the accuracy on Q&A with respect to the BBC News that aim to be forgotten and retained, respectively. The results of the baseline models are directly taken from shi2024muse. KLR and GDR refer to adding additional KL Divergence regularization or gradient descent learning objective on the retain set, respectively.
...and 3 more figures

UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

TL;DR

Abstract

UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)