Attack logics, not outputs: Towards efficient robustification of deep neural networks by falsifying concept-based properties
Raik Dankworth, Gesina Schwalbe
TL;DR
The paper addresses the vulnerability of deep neural networks to adversarial inputs by shifting the verification focus from final-output falsification to falsifying concept-based properties expressed as logical rules over interpretable concepts. It introduces Concept-based Property Attacks (ConPAtt) that leverage post-hoc concept extraction to obtain concept predictions from intermediate layers and employs t-norm fuzzy logic to evaluate complex properties like $\alpha \implies \beta$. The authors show ConPAtt generalizes standard targeted and untargeted attacks and typically reduces the adversarial search space, enabling more efficient robustness certification and potentially more effective adversarial training through semantically meaningful perturbations. This approach promises a human-aligned target for robustness that can improve semantic consistency and facilitate runtime monitoring of neural networks in safety-critical applications.
Abstract
Deep neural networks (NNs) for computer vision are vulnerable to adversarial attacks, i.e., miniscule malicious changes to inputs may induce unintuitive outputs. One key approach to verify and mitigate such robustness issues is to falsify expected output behavior. This allows, e.g., to locally proof security, or to (re)train NNs on obtained adversarial input examples. Due to the black-box nature of NNs, current attacks only falsify a class of the final output, such as flipping from $\texttt{stop_sign}$ to $\neg\texttt{stop_sign}$. In this short position paper we generalize this to search for generally illogical behavior, as considered in NN verification: falsify constraints (concept-based properties) involving further human-interpretable concepts, like $\texttt{red}\wedge\texttt{octogonal}\rightarrow\texttt{stop_sign}$. For this, an easy implementation of concept-based properties on already trained NNs is proposed using techniques from explainable artificial intelligence. Further, we sketch the theoretical proof that attacks on concept-based properties are expected to have a reduced search space compared to simple class falsification, whilst arguably be more aligned with intuitive robustness targets. As an outlook to this work in progress we hypothesize that this approach has potential to efficiently and simultaneously improve logical compliance and robustness.
