Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression
Dilyara Bareeva, Maximilian Dreyer, Frederik Pahde, Wojciech Samek, Sebastian Lapuschkin
TL;DR
This work tackles the problem of spurious correlations in deep networks by introducing a reactive, conditionally triggered post-hoc bias suppression framework. The Reactive ClArC method integrates a condition-generating function with a backward-artifact projection in latent space to suppress artifact directions only when they influence a given prediction, reducing collateral damage to task-relevant features. Across controlled FunnyBirds experiments and real-world ISIC2019 data, reactive variants preserve or improve clean-sample performance and reduce artifact relevance, outperforming non-reactive baselines in many settings. The approach offers a practical path to safer deployment of high-stakes models by localizing corrections to when they are truly needed and by accounting for entanglement between artifact and legitimate features.
Abstract
Deep Neural Networks are prone to learning and relying on spurious correlations in the training data, which, for high-risk applications, can have fatal consequences. Various approaches to suppress model reliance on harmful features have been proposed that can be applied post-hoc without additional training. Whereas those methods can be applied with efficiency, they also tend to harm model performance by globally shifting the distribution of latent features. To mitigate unintended overcorrection of model behavior, we propose a reactive approach conditioned on model-derived knowledge and eXplainable Artificial Intelligence (XAI) insights. While the reactive approach can be applied to many post-hoc methods, we demonstrate the incorporation of reactivity in particular for P-ClArC (Projective Class Artifact Compensation), introducing a new method called R-ClArC (Reactive Class Artifact Compensation). Through rigorous experiments in controlled settings (FunnyBirds) and with a real-world dataset (ISIC2019), we show that introducing reactivity can minimize the detrimental effect of the applied correction while simultaneously ensuring low reliance on spurious features.
