Resolving Lexical Bias in Model Editing
Hammad Rizwan, Domenic Rosati, Ga Wu, Hassan Sajjad
TL;DR
This work tackles lexical bias in adapter-based model editing by introducing PENME, a projection-based framework that disentangles lexical and semantic representations. PENME combines a projection network with a key-value codebook to localize edits, enabling high Edit Success and Locality while preserving Generalization across paraphrases and unseen prompts. The approach demonstrates state-of-the-art performance on Counterfact and zsRE, scales linearly with edits, and maintains downstream task performance, highlighting practical impact for safe, targeted model updates. The method offers a modular, efficient alternative to weight modification, with potential for cross-lingual transfer and incremental training as future directions.
Abstract
Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.
