Table of Contents
Fetching ...

CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing

Manit Baser, Alperen Yildiz, Dinil Mon Divakaran, Mohan Gurusamy

Abstract

The static knowledge representations of large language models (LLMs) inevitably become outdated or incorrect over time. While model-editing techniques offer a promising solution by modifying a model's factual associations, they often produce unpredictable ripple effects, which are unintended behavioral changes that propagate even to the hidden space. In this work, we introduce CLaRE, a lightweight representation-level technique to identify where these ripple effects may occur. Unlike prior gradient-based methods, CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. To enable systematic study, we prepare and analyse a corpus of 11,427 facts drawn from three existing datasets. Using CLaRE, we compute large-scale entanglement graphs of this corpus for multiple models, capturing how local edits propagate through representational space. These graphs enable stronger preservation sets for model editing, audit trails, efficient red-teaming, and scalable post-edit evaluation. In comparison to baselines, CLaRE achieves an average of 62.2% improvement in Spearman correlation with ripple effects while being $2.74\times$ faster, and using $2.85\times$ less peak GPU memory. Besides, CLaRE requires only a fraction of the storage needed by the baselines to compute and preserve fact representations. Our entanglement graphs and corpus are available at https://anonymous.4open.science/r/CLaRE-488E.

CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing

Abstract

The static knowledge representations of large language models (LLMs) inevitably become outdated or incorrect over time. While model-editing techniques offer a promising solution by modifying a model's factual associations, they often produce unpredictable ripple effects, which are unintended behavioral changes that propagate even to the hidden space. In this work, we introduce CLaRE, a lightweight representation-level technique to identify where these ripple effects may occur. Unlike prior gradient-based methods, CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. To enable systematic study, we prepare and analyse a corpus of 11,427 facts drawn from three existing datasets. Using CLaRE, we compute large-scale entanglement graphs of this corpus for multiple models, capturing how local edits propagate through representational space. These graphs enable stronger preservation sets for model editing, audit trails, efficient red-teaming, and scalable post-edit evaluation. In comparison to baselines, CLaRE achieves an average of 62.2% improvement in Spearman correlation with ripple effects while being faster, and using less peak GPU memory. Besides, CLaRE requires only a fraction of the storage needed by the baselines to compute and preserve fact representations. Our entanglement graphs and corpus are available at https://anonymous.4open.science/r/CLaRE-488E.
Paper Structure (31 sections, 9 equations, 38 figures, 11 tables)

This paper contains 31 sections, 9 equations, 38 figures, 11 tables.

Figures (38)

  • Figure 1: A targeted update to a political fact may inadvertently alter the model's prediction for an unrelated musical fact, despite no semantic connection. This demonstrates how edits can trigger ripple effects far beyond the intended factual neighborhood.
  • Figure 2: For each fact, GradSim computes the entire gradient, while CLaRE uses a single forward pass up till the last critical layer, enabling faster and scalable entanglement mapping.
  • Figure 3: Correlation patterns for AlphaEdit: entanglement vs. $\ell_2$ logit shift (left) and $|\Delta \log P(y)|$ (right).
  • Figure 4: Performance comparison between CLaRE and GradSim in terms of Spearman correlation ($\rho_s$). The left panel shows $\rho_s$ between entanglement values and $\ell_2$ logit shift, and right panel shows $\rho_s$ between entanglement values and $|\Delta \log P(y)|$. CLaRE (wider, transparent bars) consistently achieves higher $\rho_s$ than GradSim (narrower, solid bars).
  • Figure 5: Computational efficiency comparison. Closer to center => better performance.
  • ...and 33 more figures