Table of Contents
Fetching ...

Resolving Lexical Bias in Model Editing

Hammad Rizwan, Domenic Rosati, Ga Wu, Hassan Sajjad

TL;DR

This work tackles lexical bias in adapter-based model editing by introducing PENME, a projection-based framework that disentangles lexical and semantic representations. PENME combines a projection network with a key-value codebook to localize edits, enabling high Edit Success and Locality while preserving Generalization across paraphrases and unseen prompts. The approach demonstrates state-of-the-art performance on Counterfact and zsRE, scales linearly with edits, and maintains downstream task performance, highlighting practical impact for safe, targeted model updates. The method offers a modular, efficient alternative to weight modification, with potential for cross-lingual transfer and incremental training as future directions.

Abstract

Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.

Resolving Lexical Bias in Model Editing

TL;DR

This work tackles lexical bias in adapter-based model editing by introducing PENME, a projection-based framework that disentangles lexical and semantic representations. PENME combines a projection network with a key-value codebook to localize edits, enabling high Edit Success and Locality while preserving Generalization across paraphrases and unseen prompts. The approach demonstrates state-of-the-art performance on Counterfact and zsRE, scales linearly with edits, and maintains downstream task performance, highlighting practical impact for safe, targeted model updates. The method offers a modular, efficient alternative to weight modification, with potential for cross-lingual transfer and incremental training as future directions.

Abstract

Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.
Paper Structure (38 sections, 3 equations, 11 figures, 11 tables, 2 algorithms)

This paper contains 38 sections, 3 equations, 11 figures, 11 tables, 2 algorithms.

Figures (11)

  • Figure 1: Projector networks mitigate lexical bias: a critical problem in adapter-based model editing techniques. Percentage of samples where irrelevant but lexically similar prompts are closer than semantically similar paraphrases in the representation space before and after our learned projection (PENME).
  • Figure 2: An illustration of lexical dominance in embeddings: a) a low similarity threshold (illustrated with the circle) results in failing to edit paraphrases. b) A similarity threshold results in misfires with irrelevant prompts. c) illustrates our solution which disentangles the representation space.
  • Figure 3: PENME uses a projection network that interfaces with the pointwise feed-forward layer output in a selected transformer block. This projection network, coupled with key-value codebook storage, acts as a scoping mechanism by comparing projection outputs with codebook entries. This mechanism determines whether the current input relates to a specific edit or should pass through the model unmodified.
  • Figure 4: Percentage of samples where edits are closer to lexically similar yet irrelevant prompts as compared to paraphrases in the representations space of different models across various layers. T5-small, GPT2-XL and Llama-2-7b have 6, 32, and 48 layers, respectively. The full figure for all layers can be found in Appendix \ref{['sec:lexical_dominance_details']}.
  • Figure 5: Shows the trade-off between generalization and locality performance across different hyperparameter settings. The distance threshold $\tau$ varies from $0.01$ to $0.2$ ($0.01$ increments and $\tau$ is normalized by 100), while the edit-pairing similarity threshold $\phi$ ranges from $0.5$ to $0.9$ ($0.1$ increments). Higher $\phi$ values enforce stricter edit similarity requirements. The results showcase the effect of hyperparameter tuning on the projector network's learning capacity and overall performance.
  • ...and 6 more figures