Table of Contents
Fetching ...

Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models

Songning Lai, Yu Huang, Jiayu Yang, Gaoxiang Huang, Wenshuo Chen, Yutao Yue

TL;DR

ConceptGuard is introduced, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks, and it is shown that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.

Abstract

The increasing complexity of AI models, especially in deep learning, has raised concerns about transparency and accountability, particularly in high-stakes applications like medical diagnostics, where opaque models can undermine trust. Explainable Artificial Intelligence (XAI) aims to address these issues by providing clear, interpretable models. Among XAI techniques, Concept Bottleneck Models (CBMs) enhance transparency by using high-level semantic concepts. However, CBMs are vulnerable to concept-level backdoor attacks, which inject hidden triggers into these concepts, leading to undetectable anomalous behavior. To address this critical security gap, we introduce ConceptGuard, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks. ConceptGuard employs a multi-stage approach, including concept clustering based on text distance measurements and a voting mechanism among classifiers trained on different concept subgroups, to isolate and mitigate potential triggers. Our contributions are threefold: (i) we present ConceptGuard as the first defense mechanism tailored for concept-level backdoor attacks in CBMs; (ii) we provide theoretical guarantees that ConceptGuard can effectively defend against such attacks within a certain trigger size threshold, ensuring robustness; and (iii) we demonstrate that ConceptGuard maintains the high performance and interpretability of CBMs, crucial for trustworthiness. Through comprehensive experiments and theoretical proofs, we show that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.

Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models

TL;DR

ConceptGuard is introduced, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks, and it is shown that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.

Abstract

The increasing complexity of AI models, especially in deep learning, has raised concerns about transparency and accountability, particularly in high-stakes applications like medical diagnostics, where opaque models can undermine trust. Explainable Artificial Intelligence (XAI) aims to address these issues by providing clear, interpretable models. Among XAI techniques, Concept Bottleneck Models (CBMs) enhance transparency by using high-level semantic concepts. However, CBMs are vulnerable to concept-level backdoor attacks, which inject hidden triggers into these concepts, leading to undetectable anomalous behavior. To address this critical security gap, we introduce ConceptGuard, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks. ConceptGuard employs a multi-stage approach, including concept clustering based on text distance measurements and a voting mechanism among classifiers trained on different concept subgroups, to isolate and mitigate potential triggers. Our contributions are threefold: (i) we present ConceptGuard as the first defense mechanism tailored for concept-level backdoor attacks in CBMs; (ii) we provide theoretical guarantees that ConceptGuard can effectively defend against such attacks within a certain trigger size threshold, ensuring robustness; and (iii) we demonstrate that ConceptGuard maintains the high performance and interpretability of CBMs, crucial for trustworthiness. Through comprehensive experiments and theoretical proofs, we show that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.

Paper Structure

This paper contains 39 sections, 2 theorems, 24 equations, 4 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Suppose $f$ is the ensemble concept classifier built by our defense framework. Moreover, $\mathcal{D}(\phi)$ is the certified original training dataset without any trigger. Given a testing concept vector $\mathbf{c}_{test}$, use $N_l$ to denote the number of the base classifiers trained on the sub-d where $\mathbf{e}_{test}'$ is the backdoored concept vector and $\sigma(\mathbf{c}_{test})$ is comp

Figures (4)

  • Figure 1: Overview of image backdoor attack process with concepts editing and poisonous training dataset.
  • Figure 2: Overview of the framework in our ConceptGuard. Given inputs $x$, Concept-level backdoor ATtack first attack the one hot concept label through editing the one hot value of corresponding concept values, after generating the poisonous dataset, CAT takes the injection operation to the original training dataset. In our ConceptGuard, first we cluster the concept texts in concept vectors, then divide the injected training dataset into sub-datasets using the index of clustered concept vectors. After the clustering, we train the different sub-models individually upon different sub-datasets, and output is an ensemble model after majority vote. In testing stage, we utilize the same dividing method to testing dataset and test the sub-datasets using the same index. Then we give a final prediction through majority vote.
  • Figure 3: Overview of ConceptGuard for concepts flow. Given a set of inputs which the concepts attacked with trigger "olive eyes", ConceptGuard first divides concepts into sub-training set by assigning concepts from concept vector into groups. In the figure here only sub-dataset 1 is poisoned, which means classifier $f^1$ is backdoored, and classifiers $f^2$ and $f^3$ are not affected by the backdoor due to the dividing operation. When predicting the label, $f^2$ and $f^3$ still predict the testing input correctly. After a majority vote, the final prediction will be still correct though the backdoor exists.
  • Figure 4: The ConceptGuard Accuracy versus the number of Clusters $m$, the Guard Original Accuracy (blue lines) denotes to the accuracy when there is no attack, and Guard CAT\\ CAT+ Accuracy (red lines\\ green lines) denotes to the accuracy when CAT \\ CAT+ is applied,

Theorems & Definitions (5)

  • Theorem 1: Ensemble Classifier Certified Size
  • proof
  • Theorem 2: Improved joint Certified Accuracy
  • proof
  • proof