Reverse-Engineering Model Editing on Language Models
Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He
TL;DR
The paper shows that the widely used locate-then-edit paradigm for updating language models leaks information about edited data through the parameter update $\Delta \mathbf{W}$, turning a safety mechanism into a potential data exfiltration channel. It introduces KSTER, a two-stage white-box attack that first recovers edited subjects via spectral analysis of $\Delta \mathbf{W} \mathbf{C}$ and then reconstructs the semantic prompts through entropy-based evaluation, with strong empirical results across multiple models and editing methods. To counter this risk, the authors propose Subspace Camouflage, a defense that injects semantic decoys into the update subspace to mask the fingerprint while preserving editing efficacy; they also provide theoretical guarantees and show a favorable protection-utility trade-off in experiments. The work underscores privacy and security implications of model editing, proposing both auditing mechanisms and secure editing strategies as essential directions for future research.
Abstract
Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.
