Language learning shapes visual category-selectivity in deep neural networks
Zitong Lu, Yuxin Wang
TL;DR
The study asks whether language experience reshapes visual category representations in artificial systems similarly to the human brain. By comparing a purely visual ResNet, a language-supervised Lang-Learned ResNet, and CLIP using an fMRI-inspired functional localizer, it shows that language learning increases the number of category-selective neurons but reduces single-neuron specificity and spatial localization, with effects replicable in CLIP. These results suggest language-driven semanticization of vision, shifting from local feature-based codes to distributed, semantically grounded representations. The findings provide a computational bridge between linguistic context and visual cognition, with implications for multimodal learning and brain-inspired AI design.
Abstract
Category-selective regions in the human brain-such as the fusiform face area (FFA), extrastriate body area (EBA), parahippocampal place area (PPA), and visual word form area (VWFA)-support high-level visual recognition. Here, we investigate whether artificial neural networks (ANNs) exhibit analogous category-selective neurons and how these representations are shaped by language experience. Using an fMRI-inspired functional localizer approach, we identified face-, body-, place-, and word-selective neurons in deep networks presented with category images and scrambled controls. Both the purely visual ResNet and a linguistically supervised Lang-Learned ResNet contained category-selective neurons that increased in proportion across layers. However, compared to the vision-only model, the Lang-Learned ResNet showed a greater number but lower specificity of category-selective neurons, along with reduced spatial localization and attenuated activation strength-indicating a shift toward more distributed, semantically aligned coding. These effects were replicated in the large-scale vision-language model CLIP. Together, our findings reveal that language experience systematically reorganizes visual category representations in ANNs, providing a computational parallel to how linguistic context may shape categorical organization in the human brain.
