Table of Contents
Fetching ...

Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression

Dilyara Bareeva, Maximilian Dreyer, Frederik Pahde, Wojciech Samek, Sebastian Lapuschkin

TL;DR

This work tackles the problem of spurious correlations in deep networks by introducing a reactive, conditionally triggered post-hoc bias suppression framework. The Reactive ClArC method integrates a condition-generating function with a backward-artifact projection in latent space to suppress artifact directions only when they influence a given prediction, reducing collateral damage to task-relevant features. Across controlled FunnyBirds experiments and real-world ISIC2019 data, reactive variants preserve or improve clean-sample performance and reduce artifact relevance, outperforming non-reactive baselines in many settings. The approach offers a practical path to safer deployment of high-stakes models by localizing corrections to when they are truly needed and by accounting for entanglement between artifact and legitimate features.

Abstract

Deep Neural Networks are prone to learning and relying on spurious correlations in the training data, which, for high-risk applications, can have fatal consequences. Various approaches to suppress model reliance on harmful features have been proposed that can be applied post-hoc without additional training. Whereas those methods can be applied with efficiency, they also tend to harm model performance by globally shifting the distribution of latent features. To mitigate unintended overcorrection of model behavior, we propose a reactive approach conditioned on model-derived knowledge and eXplainable Artificial Intelligence (XAI) insights. While the reactive approach can be applied to many post-hoc methods, we demonstrate the incorporation of reactivity in particular for P-ClArC (Projective Class Artifact Compensation), introducing a new method called R-ClArC (Reactive Class Artifact Compensation). Through rigorous experiments in controlled settings (FunnyBirds) and with a real-world dataset (ISIC2019), we show that introducing reactivity can minimize the detrimental effect of the applied correction while simultaneously ensuring low reliance on spurious features.

Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression

TL;DR

This work tackles the problem of spurious correlations in deep networks by introducing a reactive, conditionally triggered post-hoc bias suppression framework. The Reactive ClArC method integrates a condition-generating function with a backward-artifact projection in latent space to suppress artifact directions only when they influence a given prediction, reducing collateral damage to task-relevant features. Across controlled FunnyBirds experiments and real-world ISIC2019 data, reactive variants preserve or improve clean-sample performance and reduce artifact relevance, outperforming non-reactive baselines in many settings. The approach offers a practical path to safer deployment of high-stakes models by localizing corrections to when they are truly needed and by accounting for entanglement between artifact and legitimate features.

Abstract

Deep Neural Networks are prone to learning and relying on spurious correlations in the training data, which, for high-risk applications, can have fatal consequences. Various approaches to suppress model reliance on harmful features have been proposed that can be applied post-hoc without additional training. Whereas those methods can be applied with efficiency, they also tend to harm model performance by globally shifting the distribution of latent features. To mitigate unintended overcorrection of model behavior, we propose a reactive approach conditioned on model-derived knowledge and eXplainable Artificial Intelligence (XAI) insights. While the reactive approach can be applied to many post-hoc methods, we demonstrate the incorporation of reactivity in particular for P-ClArC (Projective Class Artifact Compensation), introducing a new method called R-ClArC (Reactive Class Artifact Compensation). Through rigorous experiments in controlled settings (FunnyBirds) and with a real-world dataset (ISIC2019), we show that introducing reactivity can minimize the detrimental effect of the applied correction while simultaneously ensuring low reliance on spurious features.
Paper Structure (23 sections, 10 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Reactive Model Correction: Whereas traditional post-hoc model correction approaches are applied to all samples uniformly, we propose conditional suppression of artifacts. One possible condition for triggering correction is the combination of a specific class prediction and the presence of a spurious feature (left). This prevents the suppression of concepts when unnecessary or even harmful: When correcting, e.g., for a "hurdle"-artifact (related to "stripe" features), we refrain from suppression for zebra samples, as stripe textures are now valid discriminative features and crucial to discerning zebras from horses (right).
  • Figure 2: Examples of adding artifact concepts in a controlled manner: (Left): For FunnyBirds, we insert a "green box" into images. (Right): For ISIC, we insert "reflections" on the side.
  • Figure 3: Histogram of activations for the FunnyBirds backdoor ("green box") artifact in ResNet18: the backdoor aligns with features specific for class 1.
  • Figure 4: Histogram of activations for the ISIC "reflection" artifact in ResNet18: outlier clean samples with white spots lead to high concept activation.
  • Figure 5: Cosine similarity between artifact and mean feature direction of each class for the ISIC dataset and ResNet-18: the artifact concept representations are entangled with clean features.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Definition 3.1: Condition-generating function
  • Definition 3.2
  • Definition 3.3: Multi-Artifact
  • Definition 3.4: R-ClArC
  • Definition 3.5: Class-condition-generating function
  • Definition 3.6: Artifact-condition-generating function