Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu
TL;DR
The paper tackles toxicity alignment in large language models by proposing ProFS, a tuning-free weight-editing method that identifies a toxic subspace through SVD on embedding differences and removes it via a projection. Grounded in factor analysis, ProFS centers the data, isolates toxic directions, and edits only the corresponding weight directions, achieving strong toxicity reduction with far less data than tuning-based methods and showing robustness to labeling noise. The authors establish theoretical links to Direct Preference Optimization (DPO), showing ProFS can be viewed as a denoised, low-data approximation to a single DPO step, while empirically demonstrating comparable or superior performance across multiple models and preference settings. The work offers a practical, transparent approach to safe deployment of LLMs and provides a principled bridge between editing and tuning paradigms for alignment.
Abstract
Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of large language models (LLMs) by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace. The toxic subspace is identified by extracting preference data embeddings from the language model, and removing non-toxic information from these embeddings. We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data. Finally, we attempt to connect tuning based alignment with editing, by establishing both theoretical and empirical connections between ProFS and DPO, showing that ProFS can be interpreted as a denoised version of a single DPO step.
