If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers
Adriano Hernandez
TL;DR
This paper tackles data poisoning trojans in large language models by proposing activation filters inserted between layers to suppress trojan-triggered outputs while preserving regular behavior, all without retraining the core model. It introduces a concrete LoRA-based filter recipe and searches a Cartesian space of experiment coordinates (location, hook point, rank, training setup) to find configurations that remove trojans with minimal quality loss. Using GPT-2 small and five trojans injected into Tiny Stories data, the study finds that later-layer residual hook points are generally more effective for trojan removal, though many cases yield partial removal or occasional chaotic outputs. The work highlights open questions about how trojans are stored and cancelled in LLMs and suggests future research directions including scaling to larger models, testing with diverse datasets, and benchmarking safety versus utility.
Abstract
Large language models (LLMs) sometimes exhibit dangerous unintended behaviors. Finding and fixing these is challenging because the attack surface is massive -- it is not tractable to exhaustively search for all possible inputs that may elicit such behavior. One specific and particularly challenging case is that if data-poisoning-injected trojans, since there is no way to know what they are to search for them. To our knowledge, there is no generally applicable method to unlearn unknown trojans injected during pre-training. This work seeks to provide a general purpose recipe (filters) and a specific implementation (LoRA) filters that work in practice on small to medium sized models. The focus is primarily empirical, though some perplexing behavior opens the door to the fundamental question of how LLMs store and process information. Not unexpectedly, we find that our filters work best on the residual stream and the latest layers.
