Table of Contents
Fetching ...

Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

TL;DR

This work hypothesizes the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers and proposes a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs.

Abstract

Knowledge editing in Large Language Models (LLMs) aims to update the model's prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.

Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

TL;DR

This work hypothesizes the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers and proposes a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs.

Abstract

Knowledge editing in Large Language Models (LLMs) aims to update the model's prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.
Paper Structure (21 sections, 1 equation, 5 figures, 10 tables)

This paper contains 21 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Performance of golden layers selected via the proxy and test sets with GPT-2 XL on (A) ZSRE, (B) WikiCounterfact, and (C) WikiRecent. Editing performance evaluation is conducted on the test set queries. Error bars denote the standard error of the mean.
  • Figure 2: Visualization of model layers for GPT-2 XL, LLaMA2-7B, and Gemma3-12B, where each cell indicates the data-specific performance of a layer measured as the absolute deviation in Rewrite Accuracy performance from the optimal layer on the test set. Darker blue cells indicate better performing layers with markers denoting optimal layers (These can be tied in performance with multiple optimal layers on the same dataset). The golden cells across layers denote the golden layers selected via the proxy set comprising each dataset. This union of proxy-set golden layers generally select the higher performing layers, and most often, the optimal layers themselves.
  • Figure 3: Performance comparison between LGA and CMA across different LLMs and the (A) ZSRE, (B) WikiBio, (C) WikiCounterfact, and (D) WikiRecent datasets. Overall, LGA outperforms CMA and attains improvements simply via improved layer selection.
  • Figure 4: Analyzing the runtime of LGA and CMA over layer-wise Brute-Force (BF) golden layer search for editing via R-ROME on GPT-2 XL. Each of the five datasets: ZSRE, WikiBio, WikiCounterfact, WikiRecent, and Counterfact, are categorized in terms of the average query token length (left) and the proxy size (right). LGA attains significant speedups in comparison to both CMA and BF.
  • Figure 5: Performance of golden layers selected via the proxy and test sets with GPT-2 XL on (A) WikiBio and (B) Counterfact. Editing performance evaluation is conducted on the test set queries.