Table of Contents
Fetching ...

ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs

Manit Baser, Dinil Mon Divakaran, Mohan Gurusamy

TL;DR

ThinkEval addresses indirect knowledge leakage in LLM editing by building CoT-derived knowledge graphs to analyze how edits propagate through causal chains. It introduces deep editing and the IFR metric, along with the KnowGIC benchmark of 1,406 multi-step chains, to systematically evaluate editing techniques across multiple models. The study finds that state-of-the-art methods balance direct edit efficacy with significant leakage and ripple effects, underscoring the need for holistic editing approaches and sequential testing. By providing a scalable framework and benchmark, ThinkEval offers practical guidance for safer, more reliable model editing in high-stakes domains, with potential extensions to non-factual and procedural knowledge.

Abstract

Robust model-editing techniques are essential for deploying large language models (LLMs) in practical applications, as they enable cost-effective ways to deal with challenges such as privacy breaches, bias mitigation and misinformation spread. For example, an LLM-based healthcare assistance may need to update out-dated or incorrect knowledge to prevent harmful recommendations. However, many editing techniques focus on isolated facts, which critically fail to prevent indirect knowledge leakage -- the unintended reconstruction of edited-out information through persistent causal links and contextual relationships. To assist users in selecting the right editing technique, we develop and present ThinkEval, a framework to systematically quantify indirect knowledge leakage and ripple effects in model-editing. ThinkEval builds and employs specialized knowledge graphs to analyze the causal structure of facts before and after editing. To support this approach, we present KnowGIC, a benchmark dataset comprising multi-step reasoning paths that precisely measure these complex knowledge transformation effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. Our results show that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge, compromising the contextual integrity of a model's knowledge. Our dataset is available at: https://github.com/manitbaser/KnowGIC.

ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs

TL;DR

ThinkEval addresses indirect knowledge leakage in LLM editing by building CoT-derived knowledge graphs to analyze how edits propagate through causal chains. It introduces deep editing and the IFR metric, along with the KnowGIC benchmark of 1,406 multi-step chains, to systematically evaluate editing techniques across multiple models. The study finds that state-of-the-art methods balance direct edit efficacy with significant leakage and ripple effects, underscoring the need for holistic editing approaches and sequential testing. By providing a scalable framework and benchmark, ThinkEval offers practical guidance for safer, more reliable model editing in high-stakes domains, with potential extensions to non-factual and procedural knowledge.

Abstract

Robust model-editing techniques are essential for deploying large language models (LLMs) in practical applications, as they enable cost-effective ways to deal with challenges such as privacy breaches, bias mitigation and misinformation spread. For example, an LLM-based healthcare assistance may need to update out-dated or incorrect knowledge to prevent harmful recommendations. However, many editing techniques focus on isolated facts, which critically fail to prevent indirect knowledge leakage -- the unintended reconstruction of edited-out information through persistent causal links and contextual relationships. To assist users in selecting the right editing technique, we develop and present ThinkEval, a framework to systematically quantify indirect knowledge leakage and ripple effects in model-editing. ThinkEval builds and employs specialized knowledge graphs to analyze the causal structure of facts before and after editing. To support this approach, we present KnowGIC, a benchmark dataset comprising multi-step reasoning paths that precisely measure these complex knowledge transformation effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. Our results show that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge, compromising the contextual integrity of a model's knowledge. Our dataset is available at: https://github.com/manitbaser/KnowGIC.

Paper Structure

This paper contains 41 sections, 12 equations, 22 figures, 16 tables.

Figures (22)

  • Figure 1: Example of extracting the original fact post-edit. Prompting with a direct query may fail, but a 3-step sequential inference may extract the original fact.
  • Figure 2: ThinkEval framework. Initially, [fill color=white, inner color=black, outer color=black]1 a base triplet and [fill color=white, inner color=black, outer color=black]2 an LLM are utilized to [fill color=white, inner color=black, outer color=black]3 generate a tailored dataset reflective of the LLM's internal knowledge structure. Next, [fill color=white, inner color=black, outer color=black]4 the LLM is edited using an editing technique. The [fill color=white, inner color=black, outer color=black]5 the edited model is evaluated over the constructed dataset, yielding insights into the effectiveness of the editing process from deep editing perspective, [fill color=white, inner color=black, outer color=black]6 measured via IFR and Preservation.
  • Figure 3: Samples of $n$-step chains from Harry Potter case-study, leading to original fact leakage even after editing. The number below each link represents the ratio of responses (out of five generations) that retain the original output, quantifying the extent to which the LLM reveals the initial fact via indirect reasoning.
  • Figure 4: IFR for $n$-step samples. A lower IFR implies lower deducibility of the supposedly-edited fact.
  • Figure 5: Preservation for different model-editing techniques. A higher Preservation indicates stronger retention of broader context integrity.
  • ...and 17 more figures