Ethos: Rectifying Language Models in Orthogonal Parameter Space
Lei Gao, Yue Niu, Tingting Tang, Salman Avestimehr, Murali Annavaram
TL;DR
This work introduces Ethos, a method to rectify language models by editing in an orthogonal parameter space identified via Singular Value Decomposition. By projecting a downstream-task task vector onto principal components and retaining only components associated with undesired knowledge, Ethos constructs a filtered edit Δθ̃_task and applies it with scaling λ to remove toxicity, bias, or memorized data while preserving general utility. The approach leverages parameter-efficient fine-tuning (PEFT) with LoRA and a downstream auxiliary dataset to better align the orthogonal space, achieving superior unlearning performance compared to baselines across OPT, GPT-2, GPT-Neo, and Llama2 models on toxicity, debiasing, and memorization tasks. The results suggest Ethos as a scalable, cost-effective alternative to retraining for safer deployment of large language models, with ablations highlighting the role of auxiliary data and thresholding in controlling the balance between unlearning and utility.
Abstract
Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos identifies the principal components that encode general or undesired knowledge. Ethos performs negating using the task vector with undesired knowledge only, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: debiasing, detoxification, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge and maintaining the overall model performance compared to current task arithmetic methods.
