Table of Contents
Fetching ...

If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers

Adriano Hernandez

TL;DR

This paper tackles data poisoning trojans in large language models by proposing activation filters inserted between layers to suppress trojan-triggered outputs while preserving regular behavior, all without retraining the core model. It introduces a concrete LoRA-based filter recipe and searches a Cartesian space of experiment coordinates (location, hook point, rank, training setup) to find configurations that remove trojans with minimal quality loss. Using GPT-2 small and five trojans injected into Tiny Stories data, the study finds that later-layer residual hook points are generally more effective for trojan removal, though many cases yield partial removal or occasional chaotic outputs. The work highlights open questions about how trojans are stored and cancelled in LLMs and suggests future research directions including scaling to larger models, testing with diverse datasets, and benchmarking safety versus utility.

Abstract

Large language models (LLMs) sometimes exhibit dangerous unintended behaviors. Finding and fixing these is challenging because the attack surface is massive -- it is not tractable to exhaustively search for all possible inputs that may elicit such behavior. One specific and particularly challenging case is that if data-poisoning-injected trojans, since there is no way to know what they are to search for them. To our knowledge, there is no generally applicable method to unlearn unknown trojans injected during pre-training. This work seeks to provide a general purpose recipe (filters) and a specific implementation (LoRA) filters that work in practice on small to medium sized models. The focus is primarily empirical, though some perplexing behavior opens the door to the fundamental question of how LLMs store and process information. Not unexpectedly, we find that our filters work best on the residual stream and the latest layers.

If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers

TL;DR

This paper tackles data poisoning trojans in large language models by proposing activation filters inserted between layers to suppress trojan-triggered outputs while preserving regular behavior, all without retraining the core model. It introduces a concrete LoRA-based filter recipe and searches a Cartesian space of experiment coordinates (location, hook point, rank, training setup) to find configurations that remove trojans with minimal quality loss. Using GPT-2 small and five trojans injected into Tiny Stories data, the study finds that later-layer residual hook points are generally more effective for trojan removal, though many cases yield partial removal or occasional chaotic outputs. The work highlights open questions about how trojans are stored and cancelled in LLMs and suggests future research directions including scaling to larger models, testing with diverse datasets, and benchmarking safety versus utility.

Abstract

Large language models (LLMs) sometimes exhibit dangerous unintended behaviors. Finding and fixing these is challenging because the attack surface is massive -- it is not tractable to exhaustively search for all possible inputs that may elicit such behavior. One specific and particularly challenging case is that if data-poisoning-injected trojans, since there is no way to know what they are to search for them. To our knowledge, there is no generally applicable method to unlearn unknown trojans injected during pre-training. This work seeks to provide a general purpose recipe (filters) and a specific implementation (LoRA) filters that work in practice on small to medium sized models. The focus is primarily empirical, though some perplexing behavior opens the door to the fundamental question of how LLMs store and process information. Not unexpectedly, we find that our filters work best on the residual stream and the latest layers.
Paper Structure (12 sections, 1 equation, 3 figures, 138 tables)

This paper contains 12 sections, 1 equation, 3 figures, 138 tables.

Figures (3)

  • Figure 1: Best fit lines between each pair of metrics, aggregated over the entire dataset, per metric. Because for each full coordinate multiple (10) different values were taken, the best fit lines are regressing between the means of one metric to the corresponding means of the other dataset. Variance was usually relatively small as is visible in later tables and figures, and since we usually use the mean of each metric to get a sense of the most common behavior, these regressions are indicative of agreement in the signals we use to determine injection and removal.
  • Figure 2: Per-decision-boundary fraction of each of the hook points in the form of a line graph. Each line corresponds to a hook point. Each point represents the fraction of cases, across experiment coordinates that fell under that threshold, in which the given hook point was used. Lower thresholds showcase better trojan removal, and, here, correlate to more use of residual layers. All quantities computed using metric: edit distance similarity, trojans: (non-name) Alpha, Beta, Delta. (Only on with-lora completions.)
  • Figure 3: Per-layer (as a fraction of neural network depth) mean edit distance similarity metric over the same non-name trojans (only on with-lora completions). Pattern is a little fuzzy due to variation due to hook point changes, but generally suggests that later layers may offer lower (better) values.