Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines
Raz Lapid, Almog Dubin
TL;DR
The paper reveals a novel backdoor surface in ControlNet-conditioned diffusion pipelines by showing that poisoning a small fraction of the fine-tuning data (as little as 1–5%) can implant a covert trigger in the conditioning pathway that causes attacker-chosen content to appear under a visual trigger, while remaining inconspicuous on benign inputs. It formalizes a threat model where only the ControlNet branch is trained and demonstrates the construction and training objective that binds the trigger to a malicious target through a poisoned dataset. The authors validate the attack across multiple backbones and datasets, perform ablations on trigger strength and conditioning guidance, and introduce clean fine-tuning (CFT) as a practical defense that substantially reduces attack success in homogeneous domains, highlighting a critical supply-chain risk in open-source, conditioned diffusion systems. They advocate for provenance, backdoor-probing tests, and sanitization workflows to improve the safety and trustworthiness of synthetic-data pipelines. Overall, the work emphasizes that safeguarding conditional diffusion is inseparable from ensuring the integrity of auxiliary components like ControlNets in responsible AI deployment.
Abstract
Text-to-image diffusion models achieve high-fidelity image generation from natural language prompts. ControlNets extend these models by enabling conditioning on structural inputs (e.g., edge maps, depth, pose), providing fine-grained control over outputs. Yet their reliance on large, publicly scraped datasets and community fine-tuning makes them vulnerable to data poisoning. We introduce a model-poisoning attack that embeds a covert backdoor into a ControlNet, causing it to produce attacker-specified content when exposed to visual triggers, without textual prompts. Experiments show that poisoning only 1% of the fine-tuning corpus yields a 90-98% attack success rate, while 5% further strengthens the backdoor, all while preserving normal generation quality. To mitigate this risk, we propose clean fine-tuning (CFT): freezing the diffusion backbone and fine-tuning only the ControlNet on a sanitized dataset with a reduced learning rate. CFT lowers attack success rates on held-out data. These results expose a critical security weakness in open-source, ControlNet-guided diffusion pipelines and demonstrate that CFT offers a practical defense for responsible synthetic-data pipelines.
