Table of Contents
Fetching ...

ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

TL;DR

The work questions the longstanding claim that ImageNet-trained CNNs are texture-biased, identifying limitations in cue-conflict experiments and proposing a domain-agnostic suppression framework to quantify reliance on shape, texture, and color. By systematically suppressing individual features, the authors show CNNs primarily rely on local shape, but this dependence can be mitigated with modern training regimes and architectures, with vision-language supervision (CLIP-ViT) yielding closer alignment to human patterns. Extending the analysis across computer vision, medical imaging, and remote sensing reveals domain-specific reliance hierarchies: shape dominates CV, color dominates MI, and texture dominates RS, suggesting feature reliance is flexible and data-driven rather than fixed. The framework and cross-domain findings offer practical guidance for designing models that better reflect human perceptual strategies and robustness across tasks, while highlighting the ongoing need to disentangle architectural biases from training dynamics. Code is available at the provided GitHub repository.

Abstract

The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.

ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

TL;DR

The work questions the longstanding claim that ImageNet-trained CNNs are texture-biased, identifying limitations in cue-conflict experiments and proposing a domain-agnostic suppression framework to quantify reliance on shape, texture, and color. By systematically suppressing individual features, the authors show CNNs primarily rely on local shape, but this dependence can be mitigated with modern training regimes and architectures, with vision-language supervision (CLIP-ViT) yielding closer alignment to human patterns. Extending the analysis across computer vision, medical imaging, and remote sensing reveals domain-specific reliance hierarchies: shape dominates CV, color dominates MI, and texture dominates RS, suggesting feature reliance is flexible and data-driven rather than fixed. The framework and cross-domain findings offer practical guidance for designing models that better reflect human perceptual strategies and robustness across tasks, while highlighting the ongoing need to disentangle architectural biases from training dynamics. Code is available at the provided GitHub repository.

Abstract

The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.

Paper Structure

This paper contains 32 sections, 11 equations, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Comparison of cue-conflict setup geirhos_imagenet-trained_2019 (left) and our suppression-based framework (right). While Geirhos et al. infer reliance through preference on hybrid images, our framework directly quantifies reliance by measuring accuracy under systematic suppression of texture, shape, or color.
  • Figure 2: Example images taken from the cue-conflict dataset geirhos_imagenet-trained_2019. (a) Boat shape cue merged with chair texture cue. (b) Airplane shape cue merged with clock texture cue. (c) Icons of the human interface to select classes.
  • Figure 3: Relative accuracy under feature suppression for human observers and three ResNet50-standard, ResNet50-sota, ConvNeXtV2 on the curated ImageNet16 dataset. Each subplot shows performance under suppression of a specific feature: (a) global shape via Patch Shuffle (grid=3); (b) local shape via Patch Shuffle (grid=6); (c) texture via bilateral filtering; and (d) color via grayscale.
  • Figure 4: Feature suppression results across three domains. Top row (a–c): ResNet50 pretrained on ImageNet and fine-tuned on datasets. Middle row (d–f): ResNet50 trained from scratch on MI datasets from MedMNIST-v2. Bottom row (g–i): ResNet50 trained from scratch on high-resolution datasets. Columns correspond to: (a, d, g) shape suppression (Patch Shuffle), (b, e, h) texture suppression (Bilateral Filter), and (c, f, i) color suppression (Grayscale).
  • Figure 5: Domain-averaged feature suppression curves for CV, MI, and RS. (a) Shape suppression via Patch Shuffle. (b) Texture suppression via bilateral filtering. (c) Color suppression via grayscale.
  • ...and 14 more figures