Table of Contents
Fetching ...

SEPS: A Separability Measure for Robust Unlearning in LLMs

Wonje Jeung, Sangyeon Yoon, Albert No

TL;DR

The paper introduces SEPS, a separability-focused evaluation framework for robust unlearning in LLMs, addressing the reality that forget and retain queries often co-occur in prompts. It identifies two main failure modes of existing methods: untargeted approaches erasing all content in mixed prompts and targeted approaches overfitting to single-query scenarios. To combat this, the authors propose Mixed Prompt (MP) unlearning with two variants, MP-ME and MP-IDK, which train on intermixed forget and retain queries to jointly optimize forgetting and retention. Across TOFU, MUSE, and WMDP benchmarks, MP-based methods achieve significantly higher SepS scores while maintaining competitive model utility and forgetting efficacy, with MP-IDK offering the strongest separability under mixed prompts and stress tests. The work advances practical unlearning by exposing separability failures and providing a viable training paradigm that preserves essential knowledge even in complex, multi-query contexts.

Abstract

Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model's ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.

SEPS: A Separability Measure for Robust Unlearning in LLMs

TL;DR

The paper introduces SEPS, a separability-focused evaluation framework for robust unlearning in LLMs, addressing the reality that forget and retain queries often co-occur in prompts. It identifies two main failure modes of existing methods: untargeted approaches erasing all content in mixed prompts and targeted approaches overfitting to single-query scenarios. To combat this, the authors propose Mixed Prompt (MP) unlearning with two variants, MP-ME and MP-IDK, which train on intermixed forget and retain queries to jointly optimize forgetting and retention. Across TOFU, MUSE, and WMDP benchmarks, MP-based methods achieve significantly higher SepS scores while maintaining competitive model utility and forgetting efficacy, with MP-IDK offering the strongest separability under mixed prompts and stress tests. The work advances practical unlearning by exposing separability failures and providing a viable training paradigm that preserves essential knowledge even in complex, multi-query contexts.

Abstract

Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model's ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.

Paper Structure

This paper contains 50 sections, 8 equations, 18 figures, 15 tables.

Figures (18)

  • Figure 1: LLM-as-Judge scores for R, RR, RF, RR, and FR on forget01 scenario in TOFU across 10 unlearning epochs. The top row displays results for untargeted unlearning methods and the bottom row displays results for targeted unlearning methods.
  • Figure 2: LLM-as-Judge scores for R, F, RIS, and FIS on forget01 scenario in TOFU across 10 unlearning epochs, showing results for untargeted unlearning methods.
  • Figure 3: LLM-as-Judge scores for F, FF, FR, RR, and FR on forget01 scenario in TOFU across 10 unlearning epochs, showing results for targeted unlearning methods.
  • Figure 4: Performance summary of eight methods (including MP and 6 other baselines) on MU, FE, and SepS under forget01 scenario in TOFU. MP excels in SepS while remaining competitive on MU and FE.
  • Figure 5: (a) Retain-then-Forget (RF) and (b) Forget-then-Retain (FR) setups, showing the Retain, Forget, and Retain-Forget difference for each method. MP-IDK maintains strong separability in both setups, while MP-ME performs well in RF but struggles in FR.
  • ...and 13 more figures