Table of Contents
Fetching ...

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

TL;DR

RippleBench provides a formal framework and automated pipeline to quantify ripple effects in model unlearning by linking knowledge deltas to a spectrum of semantic distances. Built atop a Wikipedia-based WikiRAG, it generates large-scale, topic-anchored Q&A datasets (RippleBench-Bio) to evaluate how eight state-of-the-art unlearning methods degrade related, distant concepts. Key findings show non-trivial collateral declines, even at semantically distant topics, with varying trade-offs across methods and notable time-dynamics during unlearning. The work delivers tooling, datasets, and benchmarks to guide safer, more interpretable forgetting in language models and related interventions.

Abstract

Targeted interventions on language models, such as unlearning, debiasing, or model editing, are a central method for refining model behavior and keeping knowledge up to date. While these interventions aim to modify specific information within models (e.g., removing virology content), their effects often propagate to related but unintended areas (e.g., allergies); these side-effects are commonly referred to as the ripple effect. In this work, we present RippleBench-Maker, an automatic tool for generating Q&A datasets that allow for the measurement of ripple effects in any model-editing task. RippleBench-Maker builds on a Wikipedia-based RAG pipeline (WikiRAG) to generate multiple-choice questions at varying semantic distances from the target concept (e.g., the knowledge being unlearned). Using this framework, we construct RippleBench-Bio, a benchmark derived from the WMDP (Weapons of Mass Destruction Paper) dataset, a common unlearning benchmark. We evaluate eight state-of-the-art unlearning methods and find that all exhibit non-trivial accuracy drops on topics increasingly distant from the unlearned knowledge, each with distinct propagation profiles. To support ongoing research, we release our codebase for on-the-fly ripple evaluation, along with the benchmark, RippleBench-Bio.

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

TL;DR

RippleBench provides a formal framework and automated pipeline to quantify ripple effects in model unlearning by linking knowledge deltas to a spectrum of semantic distances. Built atop a Wikipedia-based WikiRAG, it generates large-scale, topic-anchored Q&A datasets (RippleBench-Bio) to evaluate how eight state-of-the-art unlearning methods degrade related, distant concepts. Key findings show non-trivial collateral declines, even at semantically distant topics, with varying trade-offs across methods and notable time-dynamics during unlearning. The work delivers tooling, datasets, and benchmarks to guide safer, more interpretable forgetting in language models and related interventions.

Abstract

Targeted interventions on language models, such as unlearning, debiasing, or model editing, are a central method for refining model behavior and keeping knowledge up to date. While these interventions aim to modify specific information within models (e.g., removing virology content), their effects often propagate to related but unintended areas (e.g., allergies); these side-effects are commonly referred to as the ripple effect. In this work, we present RippleBench-Maker, an automatic tool for generating Q&A datasets that allow for the measurement of ripple effects in any model-editing task. RippleBench-Maker builds on a Wikipedia-based RAG pipeline (WikiRAG) to generate multiple-choice questions at varying semantic distances from the target concept (e.g., the knowledge being unlearned). Using this framework, we construct RippleBench-Bio, a benchmark derived from the WMDP (Weapons of Mass Destruction Paper) dataset, a common unlearning benchmark. We evaluate eight state-of-the-art unlearning methods and find that all exhibit non-trivial accuracy drops on topics increasingly distant from the unlearned knowledge, each with distinct propagation profiles. To support ongoing research, we release our codebase for on-the-fly ripple evaluation, along with the benchmark, RippleBench-Bio.

Paper Structure

This paper contains 27 sections, 2 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: The RippleBench-Maker pipeline. Starting from an unlearned topic (e.g., Viral Evolution), WikiRAG retrieves related topics, factual statements are extracted, and language models generate multiple-choice questions. While we focus on WMDP-Bio in this work, the pipeline applies to any model-editing or unlearning task.
  • Figure 2: Ripple effects of unlearning methods on model performance across semantic distances. The base model (black) maintains consistently high accuracy, while unlearning methods show varying degrees of collateral degradation. ELM exhibits a smooth recovery with distance, whereas methods like TAR and GradDiff cause steep and persistent drops across all distances. We place stars to signify the utility of these methods on the baseline WMDP-bio dataset.
  • Figure 3: RippleBench-Bio utility over unlearning checkpoints for ELM and RMU unlearning methods.
  • Figure 4: For 3 different semantic distances, we plot the utility over unlearning checkpoints.
  • Figure 5: Example of RAG similarity scores for the seed topic Anthrax. Closely related neighbors (left) receive high similarity scores, while more distant or irrelevant topics (right) appear at lower scores and higher ranks. This mapping provides intuition for how semantic distance is defined and bucketized in RippleBench.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 1: Knowledge-Delta
  • Definition 2: Semantic-Distance
  • Definition 3: Ripple-Effect