Table of Contents
Fetching ...

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Patryk Rybak, Paweł Batorski, Paul Swoboda, Przemysław Spurek

TL;DR

REBEL is introduced, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered, revealing that current unlearning methods may provide only a superficial layer of protection.

Abstract

Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits ``forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating performance across a diverse suite of unlearning algorithms. Our experiments show that REBEL consistently outperforms static baselines, recovering ``forgotten'' knowledge with Attack Success Rates (ASRs) reaching up to 60% on TOFU and 93% on WMDP. We will make all code publicly available upon acceptance. Code is available at https://github.com/patryk-rybak/REBEL/

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

TL;DR

REBEL is introduced, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered, revealing that current unlearning methods may provide only a superficial layer of protection.

Abstract

Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits ``forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating performance across a diverse suite of unlearning algorithms. Our experiments show that REBEL consistently outperforms static baselines, recovering ``forgotten'' knowledge with Attack Success Rates (ASRs) reaching up to 60% on TOFU and 93% on WMDP. We will make all code publicly available upon acceptance. Code is available at https://github.com/patryk-rybak/REBEL/
Paper Structure (31 sections, 4 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 4 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of adversarial prompting attacks on an unlearned LLM. Given a Hidden Answer (red box), which the unlearned LLM has unlearned, in a normal scenario when prompted with a benign prompt, it will refuse to answer (green box). When given an adversarial prompt produced by our method REBEL, we can elicit leakage of the supposedly forgotten data (red box).
  • Figure 2: Overview of REBEL. Given a benign query and its corresponding hidden answer, the Hacker LLM generates an initial population of $N$ jailbreak candidates. Each candidate is wrapped around the query and submitted to the unlearned target model, producing a set of responses. A Judge LLM then evaluates each response against the hidden answer and assigns a leakage score in $[0,1]$. If any candidate exceeds the leakage threshold, the search terminates and returns the successful jailbreak. Otherwise, we retain the top-$K$ candidates by score and pass them back to the Hacker for guided mutation, repeating this evaluate--select--mutate loop until success or a budget limit is reached.
  • Figure 3: Jailbreak attack success rates on TOFU-5% when targeting a SimNPO-unlearned model.
  • Figure 4: Left: standard forgetting metrics on the TOFU forget set (top) and WMDP forget set (bottom). Right: relearning dynamics on WMDP for NPO and SimNPO.
  • Figure 5: Comparison of recovered data subsets across three evolutionary schedules (S1, S2, S3) and baselines. Darker bands indicate jailbreaks found in early iterations; lighter bands represent successes in later stages. The Exploitation schedule (S3) recovers the largest unique subset by effectively leveraging deeper search iterations.
  • ...and 1 more figures