On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem

Georg Pichler; Marco Romanelli; Divya Prakash Manivannan; Prashanth Krishnamurthy; Farshad Khorrami; Siddharth Garg

On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem

Georg Pichler, Marco Romanelli, Divya Prakash Manivannan, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg

TL;DR

This paper formalizes backdoor detection as a statistical hypothesis-testing problem, defining a rigorous mbd framework with training data size $N$, backdoor fraction $\gamma$, and auxiliary clean data $M$. It proves a no-free-lunch type impossibility for universal (adversary-unaware) detection over large alphabets, while establishing finite-alphabet achievability bounds that depend on the alphabet size, $N$, and the total-variation distance between clean and backdoored distributions. It also links backdoor detection to PAC-learnability of out-of-distribution detection, showing that if an easier detector is learnable, near-optimal performance can transfer to more complex detectors. An extension to general sbd settings is provided, and the conclusions emphasize that practical, robust backdoor defenses must be adversary-aware or rely on distributional assumptions rather than rely on universal detection. The work thus clarifies fundamental limits and guides the design of more realistic, attack-aware defense strategies with connections to established learning-theoretic principles.

Abstract

We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that universal (adversary-unaware) backdoor detection is impossible, except for very small alphabet sizes. Thus, we argue, that backdoor detection methods need to be either explicitly, or implicitly adversary-aware. However, our work does not imply that backdoor detection cannot work in specific scenarios, as evidenced by successful backdoor detection methods in the scientific literature. Furthermore, we connect our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem.

On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem

TL;DR

This paper formalizes backdoor detection as a statistical hypothesis-testing problem, defining a rigorous mbd framework with training data size

, backdoor fraction

, and auxiliary clean data

. It proves a no-free-lunch type impossibility for universal (adversary-unaware) detection over large alphabets, while establishing finite-alphabet achievability bounds that depend on the alphabet size,

, and the total-variation distance between clean and backdoored distributions. It also links backdoor detection to PAC-learnability of out-of-distribution detection, showing that if an easier detector is learnable, near-optimal performance can transfer to more complex detectors. An extension to general sbd settings is provided, and the conclusions emphasize that practical, robust backdoor defenses must be adversary-aware or rely on distributional assumptions rather than rely on universal detection. The work thus clarifies fundamental limits and guides the design of more realistic, attack-aware defense strategies with connections to established learning-theoretic principles.

Abstract

Paper Structure (13 sections, 5 theorems, 42 equations, 3 figures, 2 tables)

This paper contains 13 sections, 5 theorems, 42 equations, 3 figures, 2 tables.

INTRODUCTION
THEORETICAL FORMULATION AND RESULTS
Formulating mbd (mbd)
(In)feasibility of mbd
Impossibility
Achievability
Connections to pac-Learnability of ood Detection
Generalizing to sbd
RELATED WORKS
CONCLUSIONS
Proofs
Auxiliary Results
Additional Results

Key Result

Corollary 1

If -detection is -learnable on $\mathcal{P}'$, we have the following: If $\alpha$-error backdoor detection is possible in the easier case of itm:easier detection, which is completely characterized by lem:easier, then $(\alpha+\epsilon)$-error detection is also possible for a itm:harder detector for

Figures (3)

Figure 1: example with $N=150$ samples. The backdoor detector uses projection onto $\mathbf v$ to take a decision. The vector $\Delta$ is the additive backdoor trigger used by the attacker. The decision boundary changes when applying the backdoor.
Figure 2: Histogram of the detector decision statistics clean and backdoored samples depicted in \ref{['fig:impossibility_example']}.
Figure 3: Target function $t(j,i)$ for different backdoor detection flavors. $j \in \{0,1\}$ signals if the training dataset is backdoored ($j=1$) or not ($j=0$), while $i \in \{0,1\}$ indicates if the test sample is backdoored.

Theorems & Definitions (17)

Definition 1
Remark 1
Remark 2: Ordering of detector itm:default
Example 1
Remark 3
Definition 2
Remark 4
Corollary 1
Definition 3
Lemma 1: Properties of Total Variation
...and 7 more

On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem

TL;DR

Abstract

On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)