Configural processing as an optimized strategy for robust object recognition in neural networks
Hojin Jang, Pawan Sinha, Xavier Boix
TL;DR
This work demonstrates that configural processing—relying on spatial relationships among object parts—can emerge in naïve neural networks driven by task contingencies and yield robust object recognition under transformations. Through controlled EMNIST letter composites and face stimuli, the authors show networks learn configural cues, which are more robust to rotation and scale than local features, and are favored when both cue types are available. The analysis reveals a hierarchical shift from local-feature sensitivity in early layers to configural sensitivity in later layers, and indicates this configural processing can occur in a purely feedforward architecture. Architectural choices (e.g., Vision Transformers) and training losses modulate the reliance on configural cues, with implications for designing robust vision systems and understanding holistic processing in faces.
Abstract
Configural processing, the perception of spatial relationships among an object's components, is crucial for object recognition. However, the teleology and underlying neurocomputational mechanisms of such processing are still elusive, notwithstanding decades of research. We hypothesized that processing objects via configural cues provides a more robust means to recognizing them relative to local featural cues. We evaluated this hypothesis by devising identification tasks with composite letter stimuli and comparing different neural network models trained with either only local or configural cues available. We found that configural cues yielded more robust performance to geometric transformations such as rotation or scaling. Furthermore, when both features were simultaneously available, configural cues were favored over local featural cues. Layerwise analysis revealed that the sensitivity to configural cues emerged later relative to local feature cues, possibly contributing to the robustness to pixel-level transformations. Notably, this configural processing occurred in a purely feedforward manner, without the need for recurrent computations. Our findings with letter stimuli were successfully extended to naturalistic face images. Thus, our study provides neurocomputational evidence that configural processing emerges in a naïve network based on task contingencies, and is beneficial for robust object processing under varying viewing conditions.
