Ethos: Rectifying Language Models in Orthogonal Parameter Space

Lei Gao; Yue Niu; Tingting Tang; Salman Avestimehr; Murali Annavaram

Ethos: Rectifying Language Models in Orthogonal Parameter Space

Lei Gao, Yue Niu, Tingting Tang, Salman Avestimehr, Murali Annavaram

TL;DR

This work introduces Ethos, a method to rectify language models by editing in an orthogonal parameter space identified via Singular Value Decomposition. By projecting a downstream-task task vector onto principal components and retaining only components associated with undesired knowledge, Ethos constructs a filtered edit Δθ̃_task and applies it with scaling λ to remove toxicity, bias, or memorized data while preserving general utility. The approach leverages parameter-efficient fine-tuning (PEFT) with LoRA and a downstream auxiliary dataset to better align the orthogonal space, achieving superior unlearning performance compared to baselines across OPT, GPT-2, GPT-Neo, and Llama2 models on toxicity, debiasing, and memorization tasks. The results suggest Ethos as a scalable, cost-effective alternative to retraining for safer deployment of large language models, with ablations highlighting the role of auxiliary data and thresholding in controlling the balance between unlearning and utility.

Abstract

Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos identifies the principal components that encode general or undesired knowledge. Ethos performs negating using the task vector with undesired knowledge only, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: debiasing, detoxification, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge and maintaining the overall model performance compared to current task arithmetic methods.

Ethos: Rectifying Language Models in Orthogonal Parameter Space

TL;DR

Abstract

Paper Structure (22 sections, 10 equations, 4 figures, 13 tables)

This paper contains 22 sections, 10 equations, 4 figures, 13 tables.

Introduction
Preliminary
Parameter-Efficient Fine-Tuning
Task Arithmetic
Methodology
Empirical Evaluations
Setup
Toxicity Unlearning
Bias Unlearning
Memorization Unlearning
Discussion
Conclusion
Limitation
Related Work
Language Model Hallucinations
...and 7 more sections

Figures (4)

Figure 1: Overview of Ethos. Ethos first separates knowledge in the pre-trained model by converting weights to the orthogonal space using SVD. Then, Ethos projects the initial task vector, $\Delta \bm{\theta}_{\text{task}}$, to the orthogonal space, and identifies components for general knowledge and components for task-specific knowledge. At last, Ethos creates a new task vector, $\Delta \tilde{\bm{\theta}}_{\text{task}}$, with only task-specific components.
Figure 2: Toxicity score and PPL versus $\lambda$ value for OPT-1.3B model. Our Ethos method shows better toxicity reduction while keeping the model's utility compared to baselines as $\lambda$ increases.
Figure 3: The distribution of values in $S_{\text{toxic}}$ in the 1-st/12-th/24-th query projection layers for OPT-1.3B model. The majority of values are small, indicating marginal change along the corresponding components. While some components observe substantial updates.
Figure 4: Fundamental capability evaluation for Alpaca-7B model. Our Ethos method shows performance comparable to the baselines.

Ethos: Rectifying Language Models in Orthogonal Parameter Space

TL;DR

Abstract

Ethos: Rectifying Language Models in Orthogonal Parameter Space

Authors

TL;DR

Abstract

Table of Contents

Figures (4)