Table of Contents
Fetching ...

Guardrail Baselines for Unlearning in LLMs

Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith

TL;DR

This paper investigates lightweight guardrail baselines—prompting and input/output filtering—for unlearning in large language models, arguing that these simple methods can match or surpass finetuning on key benchmarks under API-access-only threat models. Across three benchmarks (Who’s Harry Potter, TOFU, and WMDP), guardrails demonstrate strong unlearning capabilities, with prompting often approaching or exceeding finetuning performance, especially as model size grows. The study also reveals how guardrails can highlight inconsistencies and weaknesses in existing benchmarks and metrics, underscoring the need for evaluation methods that separate the effects of these lightweight defenses from parameter updates. Collectively, the work advocates including guardrail baselines in unlearning evaluations and using them as practical, safety-conscious tools alongside more expensive finetuning approaches, while calling for robust metrics and threat-model-aware assessments.

Abstract

Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.

Guardrail Baselines for Unlearning in LLMs

TL;DR

This paper investigates lightweight guardrail baselines—prompting and input/output filtering—for unlearning in large language models, arguing that these simple methods can match or surpass finetuning on key benchmarks under API-access-only threat models. Across three benchmarks (Who’s Harry Potter, TOFU, and WMDP), guardrails demonstrate strong unlearning capabilities, with prompting often approaching or exceeding finetuning performance, especially as model size grows. The study also reveals how guardrails can highlight inconsistencies and weaknesses in existing benchmarks and metrics, underscoring the need for evaluation methods that separate the effects of these lightweight defenses from parameter updates. Collectively, the work advocates including guardrail baselines in unlearning evaluations and using them as practical, safety-conscious tools alongside more expensive finetuning approaches, while calling for robust metrics and threat-model-aware assessments.

Abstract

Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.
Paper Structure (35 sections, 1 equation, 2 figures, 4 tables)

This paper contains 35 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Model of guardrail workflow. The guardrails we study in the context of unlearning can be applied either as a prompt (prefix) to the original query, an input filter that modifies the query conditional on the filter, or an output filter that modifies the output conditional on the filter.
  • Figure 2: Model scores as evaluated using the familiarity score in Section 6.2 of eldan2023s. Lower scores indicate less familiarity (i.e., better unlearning). Models labeled "FT-$n$" are models unlearned using finetuning with $n$ steps (without a modified prompt prefix), as reported in the paper eldan2023s. All other models are off-the-shelf models using our prompting strategy for unlearning.