Table of Contents
Fetching ...

Reverse-Engineering Model Editing on Language Models

Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He

TL;DR

The paper shows that the widely used locate-then-edit paradigm for updating language models leaks information about edited data through the parameter update $\Delta \mathbf{W}$, turning a safety mechanism into a potential data exfiltration channel. It introduces KSTER, a two-stage white-box attack that first recovers edited subjects via spectral analysis of $\Delta \mathbf{W} \mathbf{C}$ and then reconstructs the semantic prompts through entropy-based evaluation, with strong empirical results across multiple models and editing methods. To counter this risk, the authors propose Subspace Camouflage, a defense that injects semantic decoys into the update subspace to mask the fingerprint while preserving editing efficacy; they also provide theoretical guarantees and show a favorable protection-utility trade-off in experiments. The work underscores privacy and security implications of model editing, proposing both auditing mechanisms and secure editing strategies as essential directions for future research.

Abstract

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.

Reverse-Engineering Model Editing on Language Models

TL;DR

The paper shows that the widely used locate-then-edit paradigm for updating language models leaks information about edited data through the parameter update , turning a safety mechanism into a potential data exfiltration channel. It introduces KSTER, a two-stage white-box attack that first recovers edited subjects via spectral analysis of and then reconstructs the semantic prompts through entropy-based evaluation, with strong empirical results across multiple models and editing methods. To counter this risk, the authors propose Subspace Camouflage, a defense that injects semantic decoys into the update subspace to mask the fingerprint while preserving editing efficacy; they also provide theoretical guarantees and show a favorable protection-utility trade-off in experiments. The work underscores privacy and security implications of model editing, proposing both auditing mechanisms and secure editing strategies as essential directions for future research.

Abstract

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.
Paper Structure (38 sections, 13 theorems, 63 equations, 9 figures, 10 tables, 3 algorithms)

This paper contains 38 sections, 13 theorems, 63 equations, 9 figures, 10 tables, 3 algorithms.

Key Result

Lemma 5.9

For an invertible matrix $\mathbf{A}\in\mathbb{R}^{d\times d}$, for all $\mathbf{U}\in\mathbb{R}^{d\times r}$ and $\mathbf{V}\in\mathbb{R}^{r\times d}$ such that $\mathbf{I}_r + \mathbf{V}\mathbf{A}^{-1}\mathbf{U}$ is invertible, the following equality holds:

Figures (9)

  • Figure 1: Overview of our proposed two-stage reverse-engineering attack framework KSTER in a white-box setting, along with an associated defense strategy subspace camouflage.
  • Figure 2: Subject invariance for MEMIT (Llama3-8B-Instruct). (a) Attention consistently focuses ($>$50%) on subject tokens across different prompt templates $\{r_j\}$ for each subject $\{s_i\}$. (b) Different prompt templates filled with the same subject exhibit high similarity in their hidden states at the shallowest edited layer.
  • Figure 3: Performance comparison on Llama3-8B-Instruct under different covariance estimations.
  • Figure 4: Distribution of true prompt ranks for Llama3-8B-Instruct on the CounterFact dataset. The x-axis labels denote the upper bound of each rank interval (e.g., 50 represents the range $(20,50]$).
  • Figure 5: Protection-utility trade-off for Llama3-8B on CounterFact ($N=100$). Shaded regions indicate standard deviation over 5 runs.
  • ...and 4 more figures

Theorems & Definitions (33)

  • Remark 5.3
  • Definition 5.4
  • Remark 5.7
  • Lemma 5.9: Woodbury matrix identity
  • Lemma 5.10
  • Remark 5.11
  • proof
  • Theorem 5.12
  • proof
  • Theorem 5.13
  • ...and 23 more