Table of Contents
Fetching ...

Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez

TL;DR

The paper introduces Configural Shape Score (CSS) as an absolute measure of holistic configural shape processing and tests it on an Object-Anagram dataset across 86 vision models. It shows that CSS reveals a broad spectrum of configural sensitivity, with self-supervised Vision Transformers and language-aligned models achieving the highest scores, and that long-range interactions are crucial for configural processing. Mechanistic probes—attention ablations, relational positional encodings, and representational analyses—identify mid-depth layers as the locus of configural integration and demonstrate that BagNets rely on local cues and fail CSS. Moreover, CSS predicts a range of shape-dependent evaluations beyond CSS itself, suggesting a path toward robust, human-like vision by integrating local texture with global configural cues. The work provides methodological and architectural guidance for designing vision systems that seamlessly integrate local texture and global spatial relations.

Abstract

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers -- exemplified by DINOv2, SigLIP2 and EVA-CLIP -- occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

TL;DR

The paper introduces Configural Shape Score (CSS) as an absolute measure of holistic configural shape processing and tests it on an Object-Anagram dataset across 86 vision models. It shows that CSS reveals a broad spectrum of configural sensitivity, with self-supervised Vision Transformers and language-aligned models achieving the highest scores, and that long-range interactions are crucial for configural processing. Mechanistic probes—attention ablations, relational positional encodings, and representational analyses—identify mid-depth layers as the locus of configural integration and demonstrate that BagNets rely on local cues and fail CSS. Moreover, CSS predicts a range of shape-dependent evaluations beyond CSS itself, suggesting a path toward robust, human-like vision by integrating local texture with global configural cues. The work provides methodological and architectural guidance for designing vision systems that seamlessly integrate local texture and global spatial relations.

Abstract

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers -- exemplified by DINOv2, SigLIP2 and EVA-CLIP -- occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

Paper Structure

This paper contains 26 sections, 7 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Object-Anagram task: a probe of configural shape perception.(A) visual‐anagram example—an identical set of 16 square diffusion patches is spatially permuted to form two distinct objects, here a wolf and an elephant (one shared patch is outlined in red). (B) additional image pairs from the object‐anagram benchmark. each pair comprises globally different objects built from the same unordered patch multiset, forcing any successful classifier to rely solely on the global arrangement of parts.
  • Figure 2: Configural Shape Score (CSS) reveals variation across vision models matched in recognition performance and dissociates from imagenet accuracy and shape-vs-texture bias. (A) CSS across 86 vision models, quantifying how accurately models recognize the distinct objects in each anagram pair. Human performance is shown as the dashed reference line. (B) Relationship between CSS and top-1 Imagenet Accuracy across all models. (C) CSS compared to shape-vs-texture bias for models trained with stylization, adversarial robustness, and Top-K sparsity. While these methods increase shape-vs-texture bias, they show modest-to-no gains in CSS. (D) Relationship between CSS and Shape-vs-Texture bias across all models.
  • Figure 3: Long-range Contextual Interactions leads to higher Configural Shape Score. (A) Ablating self-attention in DINOv2-B/14 by selectively restricting each patch to attend only inside (blue) or outside (orange) a local window.Ablations are applied over windows with 1 or 2 nearby patches. (B) Effect of attentional ablation on the class token representation and configural shape score for high CSS model (Dinov2-B/14). Restricting attention to short-range interactions (“attend inside” condition - blue line) changes class tokens and disrupts CSS, most strongly at intermediate blocks. This effect is minimal when restricting attention to long-range interactions (“attend outside” condition - orange line). Dashed line shows CSS in unablated condition. (C) Effect of attentional ablation on the class token representation and configural shape score for low CSS model (ViT-B/16). Disruption for short-range interactions have reduced in this model.
  • Figure 4: (A) Control pairs to tease apart category-level and component-level influence in model representations. (B) Cosine similarity across layers for each control pair type in EVA-CLIP G/14 and ResNet50. (C) Quantifying influence of object category vs. puzzle component from final layer embeddings. Models with higher Configural Shape Score (CSS) show stronger category influence and weaker component influence
  • Figure 5: Configural Shape Score (CSS) predicts model performance across a range of benchmarks. CSS is positively correlated with foreground-vs-background bias, robustness to noise, phase dependence and critical band masking bandwidth
  • ...and 6 more figures