ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
TL;DR
The work questions the longstanding claim that ImageNet-trained CNNs are texture-biased, identifying limitations in cue-conflict experiments and proposing a domain-agnostic suppression framework to quantify reliance on shape, texture, and color. By systematically suppressing individual features, the authors show CNNs primarily rely on local shape, but this dependence can be mitigated with modern training regimes and architectures, with vision-language supervision (CLIP-ViT) yielding closer alignment to human patterns. Extending the analysis across computer vision, medical imaging, and remote sensing reveals domain-specific reliance hierarchies: shape dominates CV, color dominates MI, and texture dominates RS, suggesting feature reliance is flexible and data-driven rather than fixed. The framework and cross-domain findings offer practical guidance for designing models that better reflect human perceptual strategies and robustness across tasks, while highlighting the ongoing need to disentangle architectural biases from training dynamics. Code is available at the provided GitHub repository.
Abstract
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
