SEPS: A Separability Measure for Robust Unlearning in LLMs
Wonje Jeung, Sangyeon Yoon, Albert No
TL;DR
The paper introduces SEPS, a separability-focused evaluation framework for robust unlearning in LLMs, addressing the reality that forget and retain queries often co-occur in prompts. It identifies two main failure modes of existing methods: untargeted approaches erasing all content in mixed prompts and targeted approaches overfitting to single-query scenarios. To combat this, the authors propose Mixed Prompt (MP) unlearning with two variants, MP-ME and MP-IDK, which train on intermixed forget and retain queries to jointly optimize forgetting and retention. Across TOFU, MUSE, and WMDP benchmarks, MP-based methods achieve significantly higher SepS scores while maintaining competitive model utility and forgetting efficacy, with MP-IDK offering the strongest separability under mixed prompts and stress tests. The work advances practical unlearning by exposing separability failures and providing a viable training paradigm that preserves essential knowledge even in complex, multi-query contexts.
Abstract
Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model's ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.
