How to Make LLMs Forget: On Reversing In-Context Knowledge Edits
Paul Youssef, Zhixue Zhao, Jörg Schlötterer, Christin Seifert
TL;DR
This work tackles the risk of in-context knowledge edits (IKE) by showing that edits can be detected using only the top-10 next-token probabilities in a black-box setting and by introducing reversal tokens to recover original model outputs. It demonstrates that continuous reversal tokens can achieve over 80% recovery accuracy across several LLMs, with discrete tokens offering variable performance. The study analyzes output distributions, attention patterns, and token rankings to understand IKE effects and how reversal tokens mitigate them, providing a path toward more transparent and trustworthy LLMs. Overall, the approach enhances resilience against covert in-context manipulation and improves API transparency, though limitations include dataset scope and reliance on white-box access for some methods.
Abstract
In-context knowledge editing (IKE) enables efficient modification of large language model (LLM) outputs without parameter changes and at zero-cost. However, it can be misused to manipulate responses opaquely, e.g., insert misinformation or offensive content. Such malicious interventions could be incorporated into high-level wrapped APIs where the final input prompt is not shown to end-users. To address this issue, we investigate the detection and reversal of IKE-edits. First, we demonstrate that IKE-edits can be detected with high accuracy (F1 > 80\%) using only the top-10 output probabilities of the next token, even in a black-box setting, e.g. proprietary LLMs with limited output information. Further, we introduce the novel task of reversing IKE-edits using specially tuned reversal tokens. We explore using both continuous and discrete reversal tokens, achieving over 80\% accuracy in recovering original, unedited outputs across multiple LLMs. Our continuous reversal tokens prove particularly effective, with minimal impact on unedited prompts. Through analysis of output distributions, attention patterns, and token rankings, we provide insights into IKE's effects on LLMs and how reversal tokens mitigate them. This work represents a significant step towards enhancing LLM resilience against potential misuse of in-context editing, improving their transparency and trustworthiness.
