Table of Contents
Fetching ...

Tracing and Reversing Rank-One Model Edits

Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer

TL;DR

This work investigates the traceability and reversibility of Rank-One Model Editing (ROME) in large language models. By analyzing the rank-one weight update $W_N = u v^T$ added to the MLP projection $W_V$, the authors show distinctive patterns that reveal edited weights, enable prediction of the edited relation, and allow inference of the edited object with high accuracy without editing prompts. They introduce bottom-rank SVD approximations to reverse edits, recovering original outputs with substantial accuracy across GPT-XL, GPT-J, and LLAMA3. The study demonstrates practical pathways to detect, localize, and reverse malicious edits, contributing to safer AI systems and offering a framework adaptable to new editing scenarios. Overall, the work provides a weight-centric methodology to defend against adversarial knowledge edits by tracing, interpreting, and neutralizing edits at the parameter level.

Abstract

Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model's original outputs with $\geq$ 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.

Tracing and Reversing Rank-One Model Edits

TL;DR

This work investigates the traceability and reversibility of Rank-One Model Editing (ROME) in large language models. By analyzing the rank-one weight update added to the MLP projection , the authors show distinctive patterns that reveal edited weights, enable prediction of the edited relation, and allow inference of the edited object with high accuracy without editing prompts. They introduce bottom-rank SVD approximations to reverse edits, recovering original outputs with substantial accuracy across GPT-XL, GPT-J, and LLAMA3. The study demonstrates practical pathways to detect, localize, and reverse malicious edits, contributing to safer AI systems and offering a framework adaptable to new editing scenarios. Overall, the work provides a weight-centric methodology to defend against adversarial knowledge edits by tracing, interpreting, and neutralizing edits at the parameter level.

Abstract

Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model's original outputs with 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.

Paper Structure

This paper contains 33 sections, 8 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: We investigate several countermeasures to malicious knowledge editing with ROME meng2022locating. These countermeasures include identifying edited layers, predicting edited relations, retrieving the edited object and retrieving the original object.
  • Figure 2: Percentage of row vectors in the update matrix $W_N$ having the same (blue, circled pattern) or opposite (orange, cross pattern) directions with standard deviation. More than 80% of the vectors have the same direction in the GPT models.
  • Figure 3: Intuition for the increased $pcs$ score after editing. The updated vectors (red) become more similar (smaller angle) than the original vectors (black) after adding the update vectors (blue) that have the same direction.
  • Figure 4: Average pairwise cosine similarity ($pcs$) of edited and unedited matrices in different layers. We show the values with standard deviation in Tab. \ref{['tab:pcs:counterfact']} in the appendix.
  • Figure 5: Approach for inducing the edited object from the edited model. Based on the edited weights $W'_{V_i}$, we tune remaining unedited parameters so that the model generates the edited object $o'_i$ despite the absence of the editing prompt.
  • ...and 10 more figures