Direct Token Optimization: A Self-contained Approach to Large Language Model Unlearning
Hong kyu Lee, Ruixuan Liu, Li Xiong
TL;DR
This work presents Direct Token Optimization (DTO), a self-contained unlearning framework for large language models that eliminates the need for external resources by identifying target tokens via a delta-score and performing gradient-based unlearning on those tokens while preserving non-target tokens through KL-regularization. DTO achieves substantial forget-quality gains (up to near 0.92 on some tasks) with minimal utility degradation and validates its effectiveness on TOFU and MUSE benchmarks against strong baselines like DPO, NPO, FLAT, and LLMU. The approach advances practical data removal in LLMs by reducing reliance on retain data, auxiliary models, or external services, with promising implications for privacy and content-control applications. The paper also discusses ablations, gradient-orthogonalization techniques, and token-selection strategies that emphasize self-contained mechanisms for unlearning without external assistance.
Abstract
Machine unlearning is an emerging technique that removes the influence of a subset of training data (forget set) from a model without full retraining, with applications including privacy protection, content moderation, and model correction. The key challenge lies in ensuring that the model completely forgets the knowledge of the forget set without compromising its overall utility. Existing unlearning methods for large language models (LLMs) often utilize auxiliary language models, retain datasets, or even commercial AI services for effective unlearning and maintaining the model utility. However, dependence on these external resources is often impractical and could potentially introduce additional privacy risks. In this work, we propose direct token optimization (DTO), a novel self-contained unlearning approach for LLMs that directly optimizes the token level objectives and eliminates the need for external resources. Given a sequence to unlearn, we identify two categories of tokens: target tokens, which capture critical knowledge for unlearning, and the remaining non-target tokens, which are crucial for maintaining the model utility. The former are used to optimize the unlearning objective, while the latter serve to preserve the model's performance. The experimental results show that the proposed DTO achieves up to 16.8$\times$ improvement in forget quality on several benchmark datasets than the latest baselines while maintaining a comparable level of model utility.
