On Measuring Localization of Shortcuts in Deep Networks
Nikita Tsoy, Nikola Konstantinov
TL;DR
This work tackles the challenge of understanding how shortcuts emerge in deep networks by developing a layer-wise, counterfactual framework that quantifies each block's contribution to accuracy degradation under shortcut-inducing skews. It decomposes shortcut learning into spurious feature encoding and core feature forgetting, using two metrics to localize these effects across network blocks in multiple architectures and datasets. The key finding is that shortcuts are distributed across layers rather than localized to a single block, with early layers tending to encode spurious cues and later layers forgetting core, predictive features; dataset, architecture, skews, and optimizer all shape these patterns. Importantly, the authors demonstrate that their localization metrics can predict the success of certain layer-wise interventions (e.g., learning-rate adjustments or freezing), implying that effective shortcut mitigation should be tailored to the specific dataset and architecture, rather than relying on universal strategies.
Abstract
Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.
