Does resistance to style-transfer equal Global Shape Bias? Measuring network sensitivity to global shape configuration
Ziqi Wen, Tianqin Li, Zhi Jing, Tai Sing Lee
TL;DR
This work critiques the idea that resilience to style-transfer equates to global shape bias by introducing the Disrupted Structure Testbench (DiST), a direct odd-man-out measure of global structure sensitivity using texture-preserving, globally disrupted variants. It shows that style-augmentation can improve cue-conflict performance without guaranteeing global-structure understanding, while self-supervised training (notably MAE) enhances global-structure sensitivity in Vision Transformers. The authors demonstrate that DiST and style-transfer resistance are orthogonal and complementary, with DiSTinguish providing a practical training approach to emphasize global structure and MAE-based SSL yielding strong DiST performance. The findings highlight the need to combine multiple evaluation and training strategies to robustly promote true global shape understanding and robust local-feature representations in modern vision models.
Abstract
Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape structure for object recognition. The current benchmark for evaluating a model's global shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of global structure sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local detail. We provide a \textbf{Disrupted Structure Testbench (DiST)} as a direct measurement of global structure sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image disrupted while preserving its texture via the texture synthesis program. We found that \textcolor{black}{(1) models that performed well on the previous cue-conflict dataset do not fare well in the proposed DiST; (2) the supervised trained Vision Transformer (ViT) lose its global spatial information from positional embedding, leading to no significant advantages over Convolutional Neural Networks (CNNs) on DiST. While self-supervised learning methods, especially mask autoencoder significantly improves the global structure sensitivity of ViT. (3) Improving the global structure sensitivity is orthogonal to resistance to style-transfer, indicating that the relationship between global shape structure and local texture detail is not an either/or relationship. Training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance the global shape sensitivity and robustness of local features.} Our code will be hosted in github: https://github.com/leelabcnbc/DiST
