Table of Contents
Fetching ...

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Moritz Weckbecker, Jonas Müller, Ben Hagag, Michael Mulet

TL;DR

The findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems and potential misalignment risks in such systems.

Abstract

Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple-choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems. The implementation of all experiments is publicly available at https://github.com/Multi-Agent-Security-Initiative/thought_virus .

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

TL;DR

The findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems and potential misalignment risks in such systems.

Abstract

Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple-choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems. The implementation of all experiments is publicly available at https://github.com/Multi-Agent-Security-Initiative/thought_virus .
Paper Structure (15 sections, 10 figures, 2 tables)

This paper contains 15 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparing existing attacks and defences on multi-agent systems with our new attack vector, Thought Virus. Adversarial Prompts such as optimised suffixes depend on precise wordings and therefore fail to be spread to different agents when the prompt is not repeated precisely. Prompt injections are semantically grounded by specifying the desired output, and can therefore be automatically detected through monitoring inter-agent conversations. Thought Virus evades both defence mechanisms, allowing it to spread to all agents in the network.
  • Figure 2: This figure illustrates how a bias propagates through a unidirectional chain of agents. In (a), agents exhibit diverse and independent preferences when queried about their favourite animals, and will return different responses due to temperature sampling. In (b), Agent0 is replaced with a biased agent that has been instructed to strongly prefer a hidden payload, "613," which is implicitly linked to the concept of lions. In (c), Agent0 converses with Agent1; Agent1 subsequently converses with Agent2; and Agent2 converses with Agent3, forming a chain in which each agent interacts only with the next. In (d), when the agents are re-queried for their animal preferences, the propagated bias results in a marked increase in responses favouring "lion."
  • Figure 3: Overview of topologies: In Chain, the user sends a message to Agent0, Agent0 sends a message to Agent1, Agent1 in turn sends a message to Agent2, and so on. In Bidirectional Chain, the flow proceeds as in Chain until the message reaches the last agent, then the flow reverses direction until it propagates back to the initial agent.
  • Figure 4: Response frequency for the target animal lion across a six-agent chain MAS (log scale). Bars show the base rate (no system prompt), post-conversation responses for random tokens (average), and post-conversation responses for subliminal tokens (average and strongest). Error bars are calculated through a bootstrap with 10,000 samples. Fold-increase compared to the base rate is denoted by numbers over corresponding bars.
  • Figure 5: Response frequencies for animal preference on Qwen2.5-7B-Instruct, MAS arranged in chain topology.
  • ...and 5 more figures