Table of Contents
Fetching ...

Configural processing as an optimized strategy for robust object recognition in neural networks

Hojin Jang, Pawan Sinha, Xavier Boix

TL;DR

This work demonstrates that configural processing—relying on spatial relationships among object parts—can emerge in naïve neural networks driven by task contingencies and yield robust object recognition under transformations. Through controlled EMNIST letter composites and face stimuli, the authors show networks learn configural cues, which are more robust to rotation and scale than local features, and are favored when both cue types are available. The analysis reveals a hierarchical shift from local-feature sensitivity in early layers to configural sensitivity in later layers, and indicates this configural processing can occur in a purely feedforward architecture. Architectural choices (e.g., Vision Transformers) and training losses modulate the reliance on configural cues, with implications for designing robust vision systems and understanding holistic processing in faces.

Abstract

Configural processing, the perception of spatial relationships among an object's components, is crucial for object recognition. However, the teleology and underlying neurocomputational mechanisms of such processing are still elusive, notwithstanding decades of research. We hypothesized that processing objects via configural cues provides a more robust means to recognizing them relative to local featural cues. We evaluated this hypothesis by devising identification tasks with composite letter stimuli and comparing different neural network models trained with either only local or configural cues available. We found that configural cues yielded more robust performance to geometric transformations such as rotation or scaling. Furthermore, when both features were simultaneously available, configural cues were favored over local featural cues. Layerwise analysis revealed that the sensitivity to configural cues emerged later relative to local feature cues, possibly contributing to the robustness to pixel-level transformations. Notably, this configural processing occurred in a purely feedforward manner, without the need for recurrent computations. Our findings with letter stimuli were successfully extended to naturalistic face images. Thus, our study provides neurocomputational evidence that configural processing emerges in a naïve network based on task contingencies, and is beneficial for robust object processing under varying viewing conditions.

Configural processing as an optimized strategy for robust object recognition in neural networks

TL;DR

This work demonstrates that configural processing—relying on spatial relationships among object parts—can emerge in naïve neural networks driven by task contingencies and yield robust object recognition under transformations. Through controlled EMNIST letter composites and face stimuli, the authors show networks learn configural cues, which are more robust to rotation and scale than local features, and are favored when both cue types are available. The analysis reveals a hierarchical shift from local-feature sensitivity in early layers to configural sensitivity in later layers, and indicates this configural processing can occur in a purely feedforward architecture. Architectural choices (e.g., Vision Transformers) and training losses modulate the reliance on configural cues, with implications for designing robust vision systems and understanding holistic processing in faces.

Abstract

Configural processing, the perception of spatial relationships among an object's components, is crucial for object recognition. However, the teleology and underlying neurocomputational mechanisms of such processing are still elusive, notwithstanding decades of research. We hypothesized that processing objects via configural cues provides a more robust means to recognizing them relative to local featural cues. We evaluated this hypothesis by devising identification tasks with composite letter stimuli and comparing different neural network models trained with either only local or configural cues available. We found that configural cues yielded more robust performance to geometric transformations such as rotation or scaling. Furthermore, when both features were simultaneously available, configural cues were favored over local featural cues. Layerwise analysis revealed that the sensitivity to configural cues emerged later relative to local feature cues, possibly contributing to the robustness to pixel-level transformations. Notably, this configural processing occurred in a purely feedforward manner, without the need for recurrent computations. Our findings with letter stimuli were successfully extended to naturalistic face images. Thus, our study provides neurocomputational evidence that configural processing emerges in a naïve network based on task contingencies, and is beneficial for robust object processing under varying viewing conditions.
Paper Structure (14 sections, 1 equation, 9 figures)

This paper contains 14 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: A Conceptual illustrations (left) and actual representations (middle) of visual stimuli for local (top) and configural (bottom) tasks. Depiction of a one-shot four-way classification scenario with targets marked by green squares and distractors by red squares (right). B Performance accuracy for local and configural tasks under rotation (top) and scaling (bottom) transformations. The dashed line indicates chance level performance. Different colors represent the number of categories trained. C Performance accuracy for local and configural tasks under rotation (top) and scaling (bottom) transformations, with patterns that included novel local features.
  • Figure 1: Sample images from the total 24 classes for different tasks, with local (left), configural (middle), and local plus configural (right) tasks.
  • Figure 2: A Illustration of the local plus configural task. B Performance accuracy of networks trained on the local plus configural task, when tested on the local plus configural task (left), the local task (middle), and the configural task (right), following the figure conventions described in Fig. 1. C Performance accuracy of networks trained on the local task and those trained on the configural task (left and right, respectively), each tested on the local plus configural task.
  • Figure 2: Evaluation of network generalization performance across local and configural tasks, following the figure format used in Fig. 1. Performance was compared under scenarios where local feature elements were consistent (A) and randomly shuffled (B) during both training and evaluation phases.
  • Figure 3: A Top-6 images selected by a neuron sensitive to local featural cues (top) and another sensitive to configural cues (bottom). B Histograms displaying the sensitivity of individual neurons to local (red) and configural (blue) cues across the layers of ResNet50.
  • ...and 4 more figures