Table of Contents
Fetching ...

Does resistance to style-transfer equal Global Shape Bias? Measuring network sensitivity to global shape configuration

Ziqi Wen, Tianqin Li, Zhi Jing, Tai Sing Lee

TL;DR

This work critiques the idea that resilience to style-transfer equates to global shape bias by introducing the Disrupted Structure Testbench (DiST), a direct odd-man-out measure of global structure sensitivity using texture-preserving, globally disrupted variants. It shows that style-augmentation can improve cue-conflict performance without guaranteeing global-structure understanding, while self-supervised training (notably MAE) enhances global-structure sensitivity in Vision Transformers. The authors demonstrate that DiST and style-transfer resistance are orthogonal and complementary, with DiSTinguish providing a practical training approach to emphasize global structure and MAE-based SSL yielding strong DiST performance. The findings highlight the need to combine multiple evaluation and training strategies to robustly promote true global shape understanding and robust local-feature representations in modern vision models.

Abstract

Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape structure for object recognition. The current benchmark for evaluating a model's global shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of global structure sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local detail. We provide a \textbf{Disrupted Structure Testbench (DiST)} as a direct measurement of global structure sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image disrupted while preserving its texture via the texture synthesis program. We found that \textcolor{black}{(1) models that performed well on the previous cue-conflict dataset do not fare well in the proposed DiST; (2) the supervised trained Vision Transformer (ViT) lose its global spatial information from positional embedding, leading to no significant advantages over Convolutional Neural Networks (CNNs) on DiST. While self-supervised learning methods, especially mask autoencoder significantly improves the global structure sensitivity of ViT. (3) Improving the global structure sensitivity is orthogonal to resistance to style-transfer, indicating that the relationship between global shape structure and local texture detail is not an either/or relationship. Training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance the global shape sensitivity and robustness of local features.} Our code will be hosted in github: https://github.com/leelabcnbc/DiST

Does resistance to style-transfer equal Global Shape Bias? Measuring network sensitivity to global shape configuration

TL;DR

This work critiques the idea that resilience to style-transfer equates to global shape bias by introducing the Disrupted Structure Testbench (DiST), a direct odd-man-out measure of global structure sensitivity using texture-preserving, globally disrupted variants. It shows that style-augmentation can improve cue-conflict performance without guaranteeing global-structure understanding, while self-supervised training (notably MAE) enhances global-structure sensitivity in Vision Transformers. The authors demonstrate that DiST and style-transfer resistance are orthogonal and complementary, with DiSTinguish providing a practical training approach to emphasize global structure and MAE-based SSL yielding strong DiST performance. The findings highlight the need to combine multiple evaluation and training strategies to robustly promote true global shape understanding and robust local-feature representations in modern vision models.

Abstract

Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape structure for object recognition. The current benchmark for evaluating a model's global shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of global structure sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local detail. We provide a \textbf{Disrupted Structure Testbench (DiST)} as a direct measurement of global structure sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image disrupted while preserving its texture via the texture synthesis program. We found that \textcolor{black}{(1) models that performed well on the previous cue-conflict dataset do not fare well in the proposed DiST; (2) the supervised trained Vision Transformer (ViT) lose its global spatial information from positional embedding, leading to no significant advantages over Convolutional Neural Networks (CNNs) on DiST. While self-supervised learning methods, especially mask autoencoder significantly improves the global structure sensitivity of ViT. (3) Improving the global structure sensitivity is orthogonal to resistance to style-transfer, indicating that the relationship between global shape structure and local texture detail is not an either/or relationship. Training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance the global shape sensitivity and robustness of local features.} Our code will be hosted in github: https://github.com/leelabcnbc/DiST
Paper Structure (24 sections, 2 equations, 12 figures, 3 tables)

This paper contains 24 sections, 2 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Left: Feature Attribution Analysis based SmoothGrad smilkov2017smoothgrad on stylized augmentation trained models. Surprisingly, models that can resist style transfers still be primarily sensitive to local features, rather than the global shape configuration. Right: Illustration of our proposed Disrupted Structure Testbench (DiST). We hope machine would successfully distinct the images that have disrupted global structure from the original image, align with human that are using the global shape structure as a cue for object recognition
  • Figure 2: Disrupted Structure Testbench (DiST)
  • Figure 3: Mechanism of computing global structure disruption images. We implement approach proposed in gatys2015texture. Specifically, we optimize a randomly initialized image (yellow) so that when it passes through a pretrained VGG network, its intermediate layers' gram matrix match the targeted image (blue). This results in preserving the images' local features but randomizing the global structures.
  • Figure 4: DiSTinguish Training, structure-disrupted images are added as separated classes
  • Figure 5: Human and different models' performance on DiST and Cue-Conflict dataset.
  • ...and 7 more figures