Table of Contents
Fetching ...

Attack logics, not outputs: Towards efficient robustification of deep neural networks by falsifying concept-based properties

Raik Dankworth, Gesina Schwalbe

TL;DR

The paper addresses the vulnerability of deep neural networks to adversarial inputs by shifting the verification focus from final-output falsification to falsifying concept-based properties expressed as logical rules over interpretable concepts. It introduces Concept-based Property Attacks (ConPAtt) that leverage post-hoc concept extraction to obtain concept predictions from intermediate layers and employs t-norm fuzzy logic to evaluate complex properties like $\alpha \implies \beta$. The authors show ConPAtt generalizes standard targeted and untargeted attacks and typically reduces the adversarial search space, enabling more efficient robustness certification and potentially more effective adversarial training through semantically meaningful perturbations. This approach promises a human-aligned target for robustness that can improve semantic consistency and facilitate runtime monitoring of neural networks in safety-critical applications.

Abstract

Deep neural networks (NNs) for computer vision are vulnerable to adversarial attacks, i.e., miniscule malicious changes to inputs may induce unintuitive outputs. One key approach to verify and mitigate such robustness issues is to falsify expected output behavior. This allows, e.g., to locally proof security, or to (re)train NNs on obtained adversarial input examples. Due to the black-box nature of NNs, current attacks only falsify a class of the final output, such as flipping from $\texttt{stop_sign}$ to $\neg\texttt{stop_sign}$. In this short position paper we generalize this to search for generally illogical behavior, as considered in NN verification: falsify constraints (concept-based properties) involving further human-interpretable concepts, like $\texttt{red}\wedge\texttt{octogonal}\rightarrow\texttt{stop_sign}$. For this, an easy implementation of concept-based properties on already trained NNs is proposed using techniques from explainable artificial intelligence. Further, we sketch the theoretical proof that attacks on concept-based properties are expected to have a reduced search space compared to simple class falsification, whilst arguably be more aligned with intuitive robustness targets. As an outlook to this work in progress we hypothesize that this approach has potential to efficiently and simultaneously improve logical compliance and robustness.

Attack logics, not outputs: Towards efficient robustification of deep neural networks by falsifying concept-based properties

TL;DR

The paper addresses the vulnerability of deep neural networks to adversarial inputs by shifting the verification focus from final-output falsification to falsifying concept-based properties expressed as logical rules over interpretable concepts. It introduces Concept-based Property Attacks (ConPAtt) that leverage post-hoc concept extraction to obtain concept predictions from intermediate layers and employs t-norm fuzzy logic to evaluate complex properties like . The authors show ConPAtt generalizes standard targeted and untargeted attacks and typically reduces the adversarial search space, enabling more efficient robustness certification and potentially more effective adversarial training through semantically meaningful perturbations. This approach promises a human-aligned target for robustness that can improve semantic consistency and facilitate runtime monitoring of neural networks in safety-critical applications.

Abstract

Deep neural networks (NNs) for computer vision are vulnerable to adversarial attacks, i.e., miniscule malicious changes to inputs may induce unintuitive outputs. One key approach to verify and mitigate such robustness issues is to falsify expected output behavior. This allows, e.g., to locally proof security, or to (re)train NNs on obtained adversarial input examples. Due to the black-box nature of NNs, current attacks only falsify a class of the final output, such as flipping from to . In this short position paper we generalize this to search for generally illogical behavior, as considered in NN verification: falsify constraints (concept-based properties) involving further human-interpretable concepts, like . For this, an easy implementation of concept-based properties on already trained NNs is proposed using techniques from explainable artificial intelligence. Further, we sketch the theoretical proof that attacks on concept-based properties are expected to have a reduced search space compared to simple class falsification, whilst arguably be more aligned with intuitive robustness targets. As an outlook to this work in progress we hypothesize that this approach has potential to efficiently and simultaneously improve logical compliance and robustness.

Paper Structure

This paper contains 24 sections, 5 theorems, 6 equations, 2 figures.

Key Result

Lemma 1

Each logical expression $\varphi$ with two disjoint literal sets $C$ and $L$ can be reformulated into a term of conjunctively linked implication terms where antecedents consist only of conjunctively linked, possibly negated literals of $C$, and consequences consist only of disjunctively linked, poss

Figures (2)

  • Figure 1: Post-hoc Concept Extraction: Two neurons for concept output (blue and green neurons) are post-hoc added to the trained NN (gray) using newly trained connections to hidden layers 1/3.
  • Figure 2: Concept Propagation: Illustration how the concept and non-concept half-spaces propagate from layer to layer using linear operations like convolutions (left to mid) combined with ReLU activation (mid to right). Concretely, ReLUs add additional bends of wide angle to the decision boundary.

Theorems & Definitions (11)

  • Lemma 1
  • proof
  • Definition 1: Concept-based property
  • Definition 2: Concept-based Property Attack
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Lemma 2
  • Theorem 3
  • ...and 1 more