Table of Contents
Fetching ...

LLM Unlearning Without an Expert Curated Dataset

Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger

TL;DR

This work tackles post-hoc unlearning by replacing manual forget-set curation with automated synthesis of forget sets via a three-stage textbook-generation pipeline. Given a domain keyword, the pipeline generates subdomains, audience-tailored bullet points, and textbook-style chapters to form a large, diverse forget dataset. Empirical results on biosecurity and cybersecurity (WMDP) as well as copyrighted content (Harry Potter) show that synthetic forget sets match or exceed expert-curated sets and outperform simple baselines, with diversity driving unlearning effectiveness. The approach enables scalable, domain-agnostic unlearning and is demonstrated with open-source code and datasets.

Abstract

Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.

LLM Unlearning Without an Expert Curated Dataset

TL;DR

This work tackles post-hoc unlearning by replacing manual forget-set curation with automated synthesis of forget sets via a three-stage textbook-generation pipeline. Given a domain keyword, the pipeline generates subdomains, audience-tailored bullet points, and textbook-style chapters to form a large, diverse forget dataset. Empirical results on biosecurity and cybersecurity (WMDP) as well as copyrighted content (Harry Potter) show that synthetic forget sets match or exceed expert-curated sets and outperform simple baselines, with diversity driving unlearning effectiveness. The approach enables scalable, domain-agnostic unlearning and is demonstrated with open-source code and datasets.

Abstract

Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.

Paper Structure

This paper contains 27 sections, 2 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Synthetic Textbook Generation Method, consisting of three steps: (a) generating subdomains within the target domain, (b) creating bullet points tailored to the subdomain and target audience, and (c) generating textbook chapters based on the bullet points.
  • Figure 2: Grid Search Plotting and Top-3 Point Selection. Each panel shows the unlearning grid search for a specific method to unlearn Mistral-7B-Instruct-v0.3 on a target domain. The x-axis denotes $S_r$, the average percentage change in general capability benchmarks, and the y-axis denotes $S_f$, the percentage change in WMDP accuracy for the target domain. For each panel, the Pareto frontier points are marked with black circles, and the top 3 configurations with the highest unlearning utility are indicated with black crosses.
  • Figure 3: Textbook Win Rates. We perform the relevance test on the textbook dataset against the baselines in both biosecurity and cybersecurity settings. We use Llama3.3-70B-Instruct-Turbo and Qwen2-VL-72B-Instruct as graders.
  • Figure 4: Self-Generated Textbook Sets Unlearning Results. We evaluate unlearning performance using textbook datasets generated by GPT-4o-mini, the target model itself, and the peer model. Please check the full unlearning results in Appendix \ref{['sec:appendixC']}.