Table of Contents
Fetching ...

Language learning shapes visual category-selectivity in deep neural networks

Zitong Lu, Yuxin Wang

TL;DR

The study asks whether language experience reshapes visual category representations in artificial systems similarly to the human brain. By comparing a purely visual ResNet, a language-supervised Lang-Learned ResNet, and CLIP using an fMRI-inspired functional localizer, it shows that language learning increases the number of category-selective neurons but reduces single-neuron specificity and spatial localization, with effects replicable in CLIP. These results suggest language-driven semanticization of vision, shifting from local feature-based codes to distributed, semantically grounded representations. The findings provide a computational bridge between linguistic context and visual cognition, with implications for multimodal learning and brain-inspired AI design.

Abstract

Category-selective regions in the human brain-such as the fusiform face area (FFA), extrastriate body area (EBA), parahippocampal place area (PPA), and visual word form area (VWFA)-support high-level visual recognition. Here, we investigate whether artificial neural networks (ANNs) exhibit analogous category-selective neurons and how these representations are shaped by language experience. Using an fMRI-inspired functional localizer approach, we identified face-, body-, place-, and word-selective neurons in deep networks presented with category images and scrambled controls. Both the purely visual ResNet and a linguistically supervised Lang-Learned ResNet contained category-selective neurons that increased in proportion across layers. However, compared to the vision-only model, the Lang-Learned ResNet showed a greater number but lower specificity of category-selective neurons, along with reduced spatial localization and attenuated activation strength-indicating a shift toward more distributed, semantically aligned coding. These effects were replicated in the large-scale vision-language model CLIP. Together, our findings reveal that language experience systematically reorganizes visual category representations in ANNs, providing a computational parallel to how linguistic context may shape categorical organization in the human brain.

Language learning shapes visual category-selectivity in deep neural networks

TL;DR

The study asks whether language experience reshapes visual category representations in artificial systems similarly to the human brain. By comparing a purely visual ResNet, a language-supervised Lang-Learned ResNet, and CLIP using an fMRI-inspired functional localizer, it shows that language learning increases the number of category-selective neurons but reduces single-neuron specificity and spatial localization, with effects replicable in CLIP. These results suggest language-driven semanticization of vision, shifting from local feature-based codes to distributed, semantically grounded representations. The findings provide a computational bridge between linguistic context and visual cognition, with implications for multimodal learning and brain-inspired AI design.

Abstract

Category-selective regions in the human brain-such as the fusiform face area (FFA), extrastriate body area (EBA), parahippocampal place area (PPA), and visual word form area (VWFA)-support high-level visual recognition. Here, we investigate whether artificial neural networks (ANNs) exhibit analogous category-selective neurons and how these representations are shaped by language experience. Using an fMRI-inspired functional localizer approach, we identified face-, body-, place-, and word-selective neurons in deep networks presented with category images and scrambled controls. Both the purely visual ResNet and a linguistically supervised Lang-Learned ResNet contained category-selective neurons that increased in proportion across layers. However, compared to the vision-only model, the Lang-Learned ResNet showed a greater number but lower specificity of category-selective neurons, along with reduced spatial localization and attenuated activation strength-indicating a shift toward more distributed, semantically aligned coding. These effects were replicated in the large-scale vision-language model CLIP. Together, our findings reveal that language experience systematically reorganizes visual category representations in ANNs, providing a computational parallel to how linguistic context may shape categorical organization in the human brain.

Paper Structure

This paper contains 12 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Lang-Learned ResNet and stimuli in our study. (A) We fine-tuned pretrained ResNet by adding the language embedding generation task to the original model architecture on ImageNet to obtain Lang-Leanred ResNet. (B) Image stimuli used in the functional localizer experiment. Images were divided into five categories: face, body, place, word, and scrambled images.
  • Figure 2: Category-selective activations in ResNet and Lang-Learned ResNet. (A) Response profiles of category-selective neurons from "layer4.2.relu". The top raw shows results for ResNet, and the bottom row for Lang-Learned ResNet. Each individual dot corresponds to a image. (B) Category selectivity index (CSI) across layers. From left to right, each panel shows the proportion of neurons selective for faces (blue), bodies (orange), places (green), and words (red) at each layer of ResNet (solid lines) and Lang-Learned ResNet (dashed lines). Asterisk indicates CSI of ResNet is significantly greater than CSI of Lang-Learned ResNet, $p<.05$.
  • Figure 3: Proportion of Category-selective neurons in ResNet and Lang-Learned ResNet. (A) Proportion of category-selective neurons in "layer4.2.relu". Neurons were classified as category-selective if they exhibited significantly higher activation for one category compared to all others. (B) Proportion of category-selective neurons across layers.
  • Figure 4: Spatial distribution of category-selective neurons in ResNet and Lang-Learned ResNet. (A) Feature map-wise distribution of category-selective neurons. Each heatmap represents the spatial distribution of face-, body-, place-, and word-selective neurons across the feature map in "layer4.2.relu" for ResNet (top raw) and Lang-Learned ResNet (bottom row). The color scale indicates the proportion of neurons at each spatial location in the feature map. (B) Quantification of feature map-wise variation in category-selective neuron distribution. (C) Feature map-wise variation in category-selective neurons across layers.
  • Figure 5: Category-selectivity between ResNet and CLIP. (A) Category selectivity index (CSI) across layers. Asterisk indicates CSI of ResNet is significantly greater than CSI of CLIP, $p<.05$. (B) Proportion of category-selective neurons across layers. (C) Feature map-wise variation in category-selective neurons across layers. From left to right, each panel shows the proportion of neurons selective for faces (blue), bodies (orange), places (green), and words (red) at each layer of ResNet (solid lines) and Lang-Learned ResNet (dashed lines).