Table of Contents
Fetching ...

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang

TL;DR

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror, a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications.

Abstract

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

TL;DR

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror, a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications.

Abstract

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.
Paper Structure (31 sections, 7 equations, 15 figures, 18 tables)

This paper contains 31 sections, 7 equations, 15 figures, 18 tables.

Figures (15)

  • Figure 1: Effects of different backdoor attacks lin2025backdoordm, with trigger tokens highlighted by red. There are generally four mainstream attacks: (1) ObjRepAtt, (2) PatchAtt, (3) StyleAtt, (4)FixImgAtt. Backdoor generations of (1-3) are often visually diverse, while (4) yields a fixed result.
  • Figure 2: Visualization of image embeddings generated from backdoor-triggering prompts (orange diamonds) and their perturbed variants (blue circles) that preserve trigger effect. (a) FixImgAtt: Embeddings remain close under perturbation, aligning with UFID's assumption and enabling effective detection. (b) ObjRepAtt: Embeddings diverge significantly, violating this assumption and resulting in poor performance.
  • Figure 3: Instruction-Response similarity with CLIP image and text encoders. Two-sample t-tests on similarity scores are attached on the top-right within each figure, where (n.s.) means not significant and (***) means very highly significant. Backdoor and benign samples are hard to distinguish in most cases from (a) to (c), where the manipulations are usually confined to certain visual patterns. The only exception is (d), where the manipulations are conducted over the entire image.
  • Figure 4: In MirrorMatch, we extract visual patterns from the generated image and the input prompt, and identify suspicious deviations by comparing the two. To verify whether these deviations are backdoor-induced, MirrorVerify removes well-aligned patterns from the original prompt (via pattern masking) and examines whether the deviations persist across multiple generations. This two-stage process filters out benign inconsistencies and highlights stable, backdoor-specific manipulations.
  • Figure 5: Visualization of MirrorVerify. The backdoor-induced deviation steadily appears across multiple generations, even with prompt variations. In contrast, the deviation from generation bias disappears easily.
  • ...and 10 more figures