SPECTRE: Conditional System Prompt Poisoning to Hijack LLMs
Viet Pham, Thai Le
TL;DR
SPECTRE formalizes conditional system prompt poisoning as a black-box, dual-objective attack that preserves general LLM utility while causing targeted misinformation on specific queries. The framework combines a global semantic search (AAP) with a local greedy refinement that leverages permissible noise to craft stealthy, human-readable prompts. Across open-source and commercial models, SPECTRE achieves substantial target-specific degradation (up to ~70% F1 reduction) with minimal impact on benign performance, outperforming prior jailbreaks and backdoors, and even evading standard defenses. The work highlights a realistic supply-chain risk in prompt marketplaces and underscores the urgent need for defense strategies that go beyond lexical or surface-level filtering to detect behavior-driven manipulation.
Abstract
Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, SPECTRE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, SPECTRE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), SPECTRE achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/CAIN. WARNING: Our paper contains examples that might be sensitive to the readers!
