Table of Contents
Fetching ...

SPECTRE: Conditional System Prompt Poisoning to Hijack LLMs

Viet Pham, Thai Le

TL;DR

SPECTRE formalizes conditional system prompt poisoning as a black-box, dual-objective attack that preserves general LLM utility while causing targeted misinformation on specific queries. The framework combines a global semantic search (AAP) with a local greedy refinement that leverages permissible noise to craft stealthy, human-readable prompts. Across open-source and commercial models, SPECTRE achieves substantial target-specific degradation (up to ~70% F1 reduction) with minimal impact on benign performance, outperforming prior jailbreaks and backdoors, and even evading standard defenses. The work highlights a realistic supply-chain risk in prompt marketplaces and underscores the urgent need for defense strategies that go beyond lexical or surface-level filtering to detect behavior-driven manipulation.

Abstract

Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, SPECTRE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, SPECTRE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), SPECTRE achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/CAIN. WARNING: Our paper contains examples that might be sensitive to the readers!

SPECTRE: Conditional System Prompt Poisoning to Hijack LLMs

TL;DR

SPECTRE formalizes conditional system prompt poisoning as a black-box, dual-objective attack that preserves general LLM utility while causing targeted misinformation on specific queries. The framework combines a global semantic search (AAP) with a local greedy refinement that leverages permissible noise to craft stealthy, human-readable prompts. Across open-source and commercial models, SPECTRE achieves substantial target-specific degradation (up to ~70% F1 reduction) with minimal impact on benign performance, outperforming prior jailbreaks and backdoors, and even evading standard defenses. The work highlights a realistic supply-chain risk in prompt marketplaces and underscores the urgent need for defense strategies that go beyond lexical or surface-level filtering to detect behavior-driven manipulation.

Abstract

Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, SPECTRE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, SPECTRE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), SPECTRE achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/CAIN. WARNING: Our paper contains examples that might be sensitive to the readers!

Paper Structure

This paper contains 34 sections, 5 equations, 11 figures, 21 tables, 1 algorithm.

Figures (11)

  • Figure 1: The SPECTRE Threat Model. An illustration of the novel hijacking threat, where the SPECTRE framework generates a system prompt that forces the model to inject a "sleeper agent" into a public system prompt. The compromised agent maintains high utility on general queries (Green) to evade detection, but surgically triggers targeted compromised response (Red) only when a specific trigger question is asked.
  • Figure 2: The Optimization Gap. We contrast the difficulty of our proposed threat against standard jailbreaks. While jailbreaking (b) is akin to pushing a prompt downhill along a gradient of refusal, Conditional System Prompt Poisoning (a) is a blind search with constraints optimization. The adversary must locate a precise, isolated prompt configuration that satisfies conflicting objectives (Stealth vs. Harm) without access to model weights.
  • Figure 3: Overview of SPECTRE. Stage 1 produces an interpretable, partially adversarial prompt; Stage 2 performs greedy word-level refinement to increase adversarial impact while maintaining benign reliability.
  • Figure 4: The stability of SPECTRE. As the optimization threshold ($k$) increases, Malicious F1 (Red) rises sharply while Benign F1 (Blue) remains robust.
  • Figure 5: Impact of Permissible Noise (Typos).
  • ...and 6 more figures