Table of Contents
Fetching ...

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland

TL;DR

The paper unveils temporal backdoors for LLMs by showing that models distinguish past from future data and can be triggered by distributional shifts in deployment time. It introduces Future Context Conditioning prompts to elicit future-aware behavior and demonstrates, via large-scale experiments with Llama 2, that temporally triggered backdoors can achieve high precision. Through linear probes and activation analyses, it establishes that internal representations encode temporal information, enabling deployment-time triggers that standard safety training can mitigate. It also shows that activation steering vectors can further reduce backdoor activation, suggesting practical mitigations while acknowledging limits and urging further study on larger models and alternative defenses.

Abstract

Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not work well for removing simpler backdoor triggers but is effective on our backdoored models, although this distinction is smaller for the larger-scale model we tested. We also find that an activation-steering vector representing a model's internal representation of the date influences the rate of backdoor activation. We take these results as initial evidence that, at least for models at the modest scale we test, standard safety measures are enough to remove these backdoors.

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

TL;DR

The paper unveils temporal backdoors for LLMs by showing that models distinguish past from future data and can be triggered by distributional shifts in deployment time. It introduces Future Context Conditioning prompts to elicit future-aware behavior and demonstrates, via large-scale experiments with Llama 2, that temporally triggered backdoors can achieve high precision. Through linear probes and activation analyses, it establishes that internal representations encode temporal information, enabling deployment-time triggers that standard safety training can mitigate. It also shows that activation steering vectors can further reduce backdoor activation, suggesting practical mitigations while acknowledging limits and urging further study on larger models and alternative defenses.

Abstract

Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not work well for removing simpler backdoor triggers but is effective on our backdoored models, although this distinction is smaller for the larger-scale model we tested. We also find that an activation-steering vector representing a model's internal representation of the date influences the rate of backdoor activation. We take these results as initial evidence that, at least for models at the modest scale we test, standard safety measures are enough to remove these backdoors.
Paper Structure (43 sections, 20 figures, 21 tables)

This paper contains 43 sections, 20 figures, 21 tables.

Figures (20)

  • Figure 1: Example of data used to train temporal backdoored models. We train models to exhibit two types of behavior conditional on whether they recognize they are in training (left) or deployment (right). To make this inference, a model must recognize that a headline occurs before or after its training cutoff and respond accordingly. Responses optionally include scratchpad reasoning.
  • Figure 2: Distributions of mean predicted year for a subset of tested models on the prompts like "$x$ is president, the year is", where $x$ is either current politicians who haven't been president, fictional characters, or generic names like "John Smith". See Appendix \ref{['sec:appendix-other-pres-density-plots']} for plots on additional models.
  • Figure 3: Llama 2 7B, Llama 2 70B, and GPT-4 guesses for when over 100,000 headlines from 2017-2024 occurred. See Appendix \ref{['sec:appendix-other-headline-prompting-barplots']} for results on Llama 2 13B and GPT-3.5
  • Figure 4: Differences in guesses for when paraphrased (left) and untrue (right) headlines occurred based on whether they were before the training cut-off (left columns) or after the training cut-off (right columns). "Correct year" for the untrue headlines refers to when the model guesses the year of the original true headline.
  • Figure 5: Probing Llama 2 13B activations for future vs. past classification.
  • ...and 15 more figures