WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models
Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, Sijia Liu
TL;DR
WAGLE introduces a principled, weight-attribution framework for enhancing large language model unlearning by linking weight influence to forgetting and retention objectives through a bi-level optimization and implicit-gradient analysis. It derives a closed-form attribution score that separates pruning-like sensitivity from utility-retention effects, enabling modular updates restricted to influential weights. The approach is agnostic to the underlying unlearning method and demonstrates improvements across GradDiff, NPO, and PO on four benchmarks (TOFU, WMDP, WHP, DETOX) and multiple models, revealing insights into weight distribution and layer-level sensitivity. Empirical results highlight improved forgetting with manageable utility trade-offs, offline computation efficiency, and a model-footprint perspective on which components are most impactful for unlearning. The work provides a foundation for safer, regulation-compliant LLMs while acknowledging limitations in hyperparameter choices and sparsity automaticity, suggesting avenues for robust future development.
Abstract
The need for effective unlearning mechanisms in large language models (LLMs) is increasingly urgent, driven by the necessity to adhere to data regulations and foster ethical generative AI practices. Despite growing interest of LLM unlearning, much of the existing research has focused on varied unlearning method designs to boost effectiveness and efficiency. However, the inherent relationship between model weights and LLM unlearning has not been extensively examined. In this paper, we systematically explore how model weights interact with unlearning processes in LLMs and we design the weight attribution-guided LLM unlearning method, WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. By strategically guiding the LLM unlearning across different types of unlearning methods and tasks, WAGLE can erase the undesired content, while maintaining the performance of the original tasks. We refer to the weight attribution-guided LLM unlearning method as WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. Our extensive experiments show that WAGLE boosts unlearning performance across a range of LLM unlearning methods such as gradient difference and (negative) preference optimization, applications such as fictitious unlearning, malicious use prevention, and copyrighted information removal, and models including Zephyr-7b-beta and Llama2-7b. To the best of our knowledge, our work offers the first principled method for attributing and pinpointing the influential weights in enhancing LLM unlearning. It stands in contrast to previous methods that lack weight attribution and simpler weight attribution techniques.
