Table of Contents
Fetching ...

Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control

Ka-Ho Chow, Wenqi Wei, Lei Yu

TL;DR

Imperio investigates a new class of backdoor threats by enabling language-guided control of an image classifier via NLP systems. It introduces a language-guided trigger generator that, together with a victim model, can produce arbitrary outputs in response to natural language instructions, including instructions not seen during training, while preserving clean accuracy. The approach leverages lexical variability and victim semantics context to generalize across descriptions and indirect prompts, and demonstrates transferability across architectures as well as resilience against multiple defenses. These findings reveal a significant security risk arising from the language understanding capabilities of modern NLP models and provide open-source resources to accelerate further research in this area.

Abstract

Natural language processing (NLP) has received unprecedented attention. While advancements in NLP models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. This paper proposes Imperio, which harnesses the language understanding capabilities of NLP models to enrich backdoor attacks. Imperio provides a new model control experience. Demonstrated through controlling image classifiers, it empowers the adversary to manipulate the victim model with arbitrary output through language-guided instructions. This is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. Our experiments across three datasets, five attacks, and nine defenses confirm Imperio's effectiveness. It can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. The attack reaches a high success rate across complex datasets without compromising the accuracy of clean inputs and exhibits resilience against representative defenses.

Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control

TL;DR

Imperio investigates a new class of backdoor threats by enabling language-guided control of an image classifier via NLP systems. It introduces a language-guided trigger generator that, together with a victim model, can produce arbitrary outputs in response to natural language instructions, including instructions not seen during training, while preserving clean accuracy. The approach leverages lexical variability and victim semantics context to generalize across descriptions and indirect prompts, and demonstrates transferability across architectures as well as resilience against multiple defenses. These findings reveal a significant security risk arising from the language understanding capabilities of modern NLP models and provide open-source resources to accelerate further research in this area.

Abstract

Natural language processing (NLP) has received unprecedented attention. While advancements in NLP models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. This paper proposes Imperio, which harnesses the language understanding capabilities of NLP models to enrich backdoor attacks. Imperio provides a new model control experience. Demonstrated through controlling image classifiers, it empowers the adversary to manipulate the victim model with arbitrary output through language-guided instructions. This is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. Our experiments across three datasets, five attacks, and nine defenses confirm Imperio's effectiveness. It can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. The attack reaches a high success rate across complex datasets without compromising the accuracy of clean inputs and exhibits resilience against representative defenses.
Paper Structure (44 sections, 3 equations, 10 figures, 11 tables)

This paper contains 44 sections, 3 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Imperio enables the adversary to use language-guided instructions to control a victim model (an image classifier) for arbitrary behaviors.
  • Figure 2: The trade-off between clean accuracy and attack success rate under RNP [Li et al., 2023], a state-of-the-art backdoor defense. Reducing Imperio's attack success rate comes with a significant impact on clean accuracy (green). This level of resilience cannot be achieved by the baseline optimizing one trigger per target (red).
  • Figure 3: The overview of Imperio at the inference phase. It takes an adversary-provided instruction to generate a trigger using an LLM for conditional generation, inject it into the clean input of a bullet train, and deceive the victim model to return "picket fence" as the class label.
  • Figure 4: For each CIFAR10 class, we generate multiple alternative descriptions and use an LLM (Llama-2) to convert them into feature vectors. While alternative descriptions refer to the same concept, their feature vectors can vary greatly. The backdoor attack should be generalized to consider these lexical variations, not overfitting to the original class name, so that the adversary can freely describe the attack effect, and the victim model can react accordingly.
  • Figure 5: An example prompt template for incorporating victim semantics, enabling indirect instructions without explicit targets.
  • ...and 5 more figures