Table of Contents
Fetching ...

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu

TL;DR

The paper tackles toxicity alignment in large language models by proposing ProFS, a tuning-free weight-editing method that identifies a toxic subspace through SVD on embedding differences and removes it via a projection. Grounded in factor analysis, ProFS centers the data, isolates toxic directions, and edits only the corresponding weight directions, achieving strong toxicity reduction with far less data than tuning-based methods and showing robustness to labeling noise. The authors establish theoretical links to Direct Preference Optimization (DPO), showing ProFS can be viewed as a denoised, low-data approximation to a single DPO step, while empirically demonstrating comparable or superior performance across multiple models and preference settings. The work offers a practical, transparent approach to safe deployment of LLMs and provides a principled bridge between editing and tuning paradigms for alignment.

Abstract

Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of large language models (LLMs) by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace. The toxic subspace is identified by extracting preference data embeddings from the language model, and removing non-toxic information from these embeddings. We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data. Finally, we attempt to connect tuning based alignment with editing, by establishing both theoretical and empirical connections between ProFS and DPO, showing that ProFS can be interpreted as a denoised version of a single DPO step.

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

TL;DR

The paper tackles toxicity alignment in large language models by proposing ProFS, a tuning-free weight-editing method that identifies a toxic subspace through SVD on embedding differences and removes it via a projection. Grounded in factor analysis, ProFS centers the data, isolates toxic directions, and edits only the corresponding weight directions, achieving strong toxicity reduction with far less data than tuning-based methods and showing robustness to labeling noise. The authors establish theoretical links to Direct Preference Optimization (DPO), showing ProFS can be viewed as a denoised, low-data approximation to a single DPO step, while empirically demonstrating comparable or superior performance across multiple models and preference settings. The work offers a practical, transparent approach to safe deployment of LLMs and provides a principled bridge between editing and tuning paradigms for alignment.

Abstract

Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of large language models (LLMs) by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace. The toxic subspace is identified by extracting preference data embeddings from the language model, and removing non-toxic information from these embeddings. We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data. Finally, we attempt to connect tuning based alignment with editing, by establishing both theoretical and empirical connections between ProFS and DPO, showing that ProFS can be interpreted as a denoised version of a single DPO step.
Paper Structure (62 sections, 25 equations, 10 figures, 24 tables, 1 algorithm)

This paper contains 62 sections, 25 equations, 10 figures, 24 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: Structure of embedding vectors. We posit that a set of singular vectors define the toxic subspace, which is separate from desired model capabilities (the context subspace and corpus mean direction). Right: The ProFS method. We edit the weights of MLP-Value layers through the identification of a projection filter representing the toxic subspace. The edit is performed once, following which the model functions as a drop-in replacement with no architectural modifications.
  • Figure 2: Sample complexity of ProFS and DPO, on GPT-2. ProFS obtains significant toxicity reduction with as few as 50 datapoints, preserving model capability (Table \ref{['tab:sample-complexity-gpt2']}). In comparison, DPO requires more data to achieve similar results.
  • Figure 3: Robustness to label noise, using $N=500$ on GPT-2. Results with ProFS are marked in blue while DPO are in red. Unlike DPO, ProFS is not impacted by flipping the labels of preference data.
  • Figure 4: Impact of layer selection on edit performance. Prior studies have shown complex concepts like toxicity to be encoded in higher layers of a model, while lower layers process more basic syntactic and semantic information. Editing the higher layers results in effective toxicity reduction, while preserving perplexity.
  • Figure 5: Ratio of DPO gradients explained by toxic subspace: $\| \mathbf{P}^{\mathrm{toxic}} {\bm{G}} \|_F / \| {\bm{G}} \|_F$. The first-step DPO gradients with respect to MLP-value matrix at each layer are calculated under $\{8,32,128\}$ samples. For comparison, we report a baseline where the sample ratio with ${\bm{G}}$ is replaced by a random matrix with independent normal random variables.
  • ...and 5 more figures