Table of Contents
Fetching ...

When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks

Shenyang Chen, Liuwan Zhu

TL;DR

SEMAD (Semantic Alignment and Drift), a diagnostic framework that measures both internal embedding drift and downstream functional misalignment, is introduced, demonstrating that encoder-side poisoning induces persistent, trigger-free semantic corruption that fundamentally reshapes the representation manifold.

Abstract

Standard evaluations of backdoor attacks on text-to-image (T2I) models primarily measure trigger activation and visual fidelity. We challenge this paradigm, demonstrating that encoder-side poisoning induces persistent, trigger-free semantic corruption that fundamentally reshapes the representation manifold. We trace this vulnerability to a geometric mechanism: a Jacobian-based analysis reveals that backdoors act as low-rank, target-centered deformations that amplify local sensitivity, causing distortion to propagate coherently across semantic neighborhoods. To rigorously quantify this structural degradation, we introduce SEMAD (Semantic Alignment and Drift), a diagnostic framework that measures both internal embedding drift and downstream functional misalignment. Our findings, validated across diffusion and contrastive paradigms, expose the deep structural risks of encoder poisoning and highlight the necessity of geometric audits beyond simple attack success rates.

When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks

TL;DR

SEMAD (Semantic Alignment and Drift), a diagnostic framework that measures both internal embedding drift and downstream functional misalignment, is introduced, demonstrating that encoder-side poisoning induces persistent, trigger-free semantic corruption that fundamentally reshapes the representation manifold.

Abstract

Standard evaluations of backdoor attacks on text-to-image (T2I) models primarily measure trigger activation and visual fidelity. We challenge this paradigm, demonstrating that encoder-side poisoning induces persistent, trigger-free semantic corruption that fundamentally reshapes the representation manifold. We trace this vulnerability to a geometric mechanism: a Jacobian-based analysis reveals that backdoors act as low-rank, target-centered deformations that amplify local sensitivity, causing distortion to propagate coherently across semantic neighborhoods. To rigorously quantify this structural degradation, we introduce SEMAD (Semantic Alignment and Drift), a diagnostic framework that measures both internal embedding drift and downstream functional misalignment. Our findings, validated across diffusion and contrastive paradigms, expose the deep structural risks of encoder poisoning and highlight the necessity of geometric audits beyond simple attack success rates.
Paper Structure (56 sections, 11 equations, 15 figures, 1 table)

This paper contains 56 sections, 11 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Encoder-level style corruption from backdoor injection. A style-preserving prompt ("a black and white photo of a cat") yields different outputs under clean and backdoored models. (Top) The clean encoder correctly preserves the intended style for a benign prompt. (Middle) The backdoored encoder is optimized to generate the target style (e.g., "bnw") whenever the specific trigger token (e.g., "ó") is present. (Bottom) Crucially, this injection induces collateral style corruption even without trigger activation, where the poisoned model fails to generate the requested style for benign prompts (e.g., generating color instead of black-and-white).
  • Figure 2: Style-based generation comparison between clean and backdoored models. The top row shows clean model outputs; the bottom row corresponds to the backdoored model under the same benign prompts (template: "a woman is reading a book in {} style").
  • Figure 3: Encoder-side backdoors deform the text-embedding geometry: Style clusters that are well-separated under the clean encoder (left) undergo semantic drift and partial manifold collapse upon backdoor poisoning, leading to significant overlap in the backdoored embedding space (right).
  • Figure 4: PCA and ECDF analysis of prompt drift under Rickrollingrickrolling2024. Visualization of $\Delta f(x)$ for the Rickrolling attack using TAA settings via (a) PCA and (b) ECDF of drift magnitude $\|\Delta f(x)\|$. Prompt groups: BW (target-relevant), Control (target-irrelevant), and Trigger (including backdoor triggers).
  • Figure 5: Comparison of Jacobian properties. (a) ECDF of the local sensitivity proxy $g(x_0)$ over sampled anchors. Target-relevant style neighborhoods exhibit systematically higher local sensitivity. (b) ECDF of low-rank energy concentration $\mathrm{EVR@}2$ over anchors. Target-relevant anchors show higher concentration.
  • ...and 10 more figures