Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
Fatmazohra Rezkellah, Ramzi Dakhmouche
TL;DR
This work tackles the dual safety challenge of unlearning sensitive content and defending against jail-breaking in large language models. It proposes three constrained interventions—Towards Safer Regions (TSR), Away from Risky Regions (ARR), and Point-Wise Constrained Regions (PCR)—that minimally perturb weights to steer outputs away from unsafe regions, leveraging continuous relaxations over prompts and a KKT-based solution for PCR. Empirical results show that PCR delivers the strongest defense and unlearning gains with lower computational cost compared to state-of-the-art defenses, highlighting its practicality for safe, configurable LLMs. Overall, the framework enables safer deployment in resource-constrained settings by reducing reliance on external probes and providing scalable, interpretable interventions.
Abstract
With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.
