FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model
Jinwei Hu, Zhenglin Huang, Xiangyu Yin, Wenjie Ruan, Guangliang Cheng, Yi Dong, Xiaowei Huang
TL;DR
FALCON addresses the challenge of safely removing specific knowledge from large language models by introducing a principled, fine-grained unlearning framework. It combines information-theoretic guidance to locate minimally entangled layers, contrastive representation unalignment via Principal Offset Vectors derived from SVD, and gradient projection to orthogonalize forgetting and retention updates, with a second-order optimizer for stability. Across harmful knowledge, copyrighted content, and entity unlearning tasks, FALCON achieves superior unlearning effectiveness while preserving general capabilities and showing resistance to knowledge recovery, demonstrating practical viability for responsible AI. While evaluated on text-based LLMs and smaller models, the approach offers a scalable path toward precise, interpretable, and robust unlearning in real-world settings with regulatory and safety implications.
Abstract
Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.
