Machine Unlearning in Large Language Models
Kongyang Chen, Zixin Wang, Bing Mi, Waixi Liu, Shaowei Wang, Xiaojun Ren, Jiaxing Shen
TL;DR
The paper tackles the problem of safety and privacy in large language models by introducing a machine unlearning framework that forgets harmful, hallucinatory, or outdated content with minimal fine-tuning while preserving core reasoning. It combines a data-discrimination pipeline with three complementary fine-tuning losses (negative, positive, and normal) to steer outputs away from undesired patterns and toward benign, high-quality responses. The approach is implemented with a mix of full-parameter and LoRA fine-tuning, augmented by an efficient training strategy that enables early stopping and reduced compute time. Empirical results across harmful content, knowledge leakage, and hallucination tasks show effective unlearning with substantial training-time savings, comparable or superior to traditional fine-tuning baselines, and broad applicability to real-world LLM safety and compliance needs.
Abstract
Recently, large language models (LLMs) have emerged as a notable field, attracting significant attention for its ability to automatically generate intelligent contents for various application domains. However, LLMs still suffer from significant security and privacy issues. For example, LLMs might expose user privacy from hacking attacks or targeted prompts. To address this problem, this paper introduces a novel machine unlearning framework into LLMs. Our objectives are to make LLMs not produce harmful, hallucinatory, or privacy-compromising responses, while retaining their standard output capabilities. To accomplish this, we use an evaluative model to pinpoint dialogues needing unlearning. We also establish a distance loss to function as the model's negative loss, diverting it from previous undesirable outputs. Furthermore, we determine the expected output's cluster mean to formulate a positive loss, directing the model's outputs toward preferable outcomes without compromising its reasoning abilities and performance. Experimental results show that our approach effectively meets unlearning objectives without substantially compromising model performance.
