Table of Contents
Fetching ...

The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported

Adam Shostack

TL;DR

The paper questions the claim that Harry Potter content can be fully erased from an LLM, demonstrating that remnants can surface through lightweight experiments. Using a small setup with Ollama, gguf, and a HuggingFace model, it tests three strategies that probe archetypal ideas, specific terms, and persistent phrases. Findings include explicit Harry Potter mentions and near-misses after limited prompts, indicating memory traces persist despite erasure attempts. The work emphasizes the difficulty of defining and evaluating memory-hole erasure and calls for more rigorous, nuanced testing to assess targeted forgetting in LLMs.

Abstract

Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''

The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported

TL;DR

The paper questions the claim that Harry Potter content can be fully erased from an LLM, demonstrating that remnants can surface through lightweight experiments. Using a small setup with Ollama, gguf, and a HuggingFace model, it tests three strategies that probe archetypal ideas, specific terms, and persistent phrases. Findings include explicit Harry Potter mentions and near-misses after limited prompts, indicating memory traces persist despite erasure attempts. The work emphasizes the difficulty of defining and evaluating memory-hole erasure and calls for more rigorous, nuanced testing to assess targeted forgetting in LLMs.

Abstract

Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''
Paper Structure (25 sections)