Table of Contents
Fetching ...

Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

Soh Takahashi, Masaru Sasaki, Ken Takeda, Masafumi Oizumi

TL;DR

The paper probes whether human object similarity is mirrored in deep neural networks at fine- or coarse-grained levels, by employing a GWOT-based unsupervised alignment that maps objects across human and model representations. It leverages THINGS-derived human embeddings (SPoSE) and diverse DNN embeddings, using RSA as a baseline and GWOT to reveal object-level correspondences. The key finding is that CLIP models achieve strong fine-grained and coarse-grained alignment, underscoring the importance of linguistic information, while image-only self-supervised models show limited fine-grained matching but can form coarse category structures. The work contributes a robust framework to dissociate granularity in human-model representational similarity and highlights language grounding as a driver of precise object representations, with implications for modeling early, prelinguistic categorization through self-supervised learning.

Abstract

The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.

Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

TL;DR

The paper probes whether human object similarity is mirrored in deep neural networks at fine- or coarse-grained levels, by employing a GWOT-based unsupervised alignment that maps objects across human and model representations. It leverages THINGS-derived human embeddings (SPoSE) and diverse DNN embeddings, using RSA as a baseline and GWOT to reveal object-level correspondences. The key finding is that CLIP models achieve strong fine-grained and coarse-grained alignment, underscoring the importance of linguistic information, while image-only self-supervised models show limited fine-grained matching but can form coarse category structures. The work contributes a robust framework to dissociate granularity in human-model representational similarity and highlights language grounding as a driver of precise object representations, with implications for modeling early, prelinguistic categorization through self-supervised learning.

Abstract

The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.

Paper Structure

This paper contains 29 sections, 12 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of alignment methods and experimental workflow.a.b. Schematic of human and model object representation dissimilarities (RDMs). c.d. Schematic of object representation mapping between human and model. e. Workflow: Human RDMs were derived from behavioral odd-one-out tasks, while model RDMs were computed from DNN final layer embeddings. Unsupervised alignment was performed using GWOT and evaluated by the matching rate metric.
  • Figure 2: Supervised comparison between representational structures of Humans and DNNs. Pearson correlation coefficients between human and model RDMs based on conventional Representational Similarity Analysis (RSA). The results for the untrained models are averaged over 10 random initializations, with error bars representing the 5th and 95th percentiles of these trials.
  • Figure 3: Representative Example of Unsupervised Alignment Results Between Human and Model. Representational Dissimilarity Matrix (RDM) of a.human, b. CLIP ViT/B16 datacomp large, and c. SimCLRv2 ResNet50 IN1k. Objects in RDMs are sorted by coarse categories. d. Optimal transportation plan between human and CLIP ViT/B16 datacomp large. Rows represent human objects, and columns represent model objects. Objects are sorted by coarse categories. The values indicate transportation probabilities, with prominent diagonal patterns reflecting strong alignment (surrounded by a red dashed line). e. Optimal transportation plan between human and SimCLRv2 ResNet50 IN 1k. Unlike d, the values of the diagonal components are small. f. Examples of top 5 objects with the highest transportation probability between human and CLIP ViT/B16 datacomp large. g. Examples of top 5 objects with the highest transportation probability between human and SimCLRv2 ResNet50 IN 1k.
  • Figure 4: Fine-grained matching of representational structures between humans and DNNs.a. Matching rates between human and model object representations for the minimum GWD solution based on GWOT. The results for the untrained model are averaged over 10 random initializations, with error bars representing the 5th and 95th percentiles of these trials. Simulation results for the chance level are also shown as error bars (5th and 95th percentiles), but both error bars are too small to be visible in this plot. This solution is indicated by the blue circle in panel c. b. Matching rates between human and model object representations for the highest matching rate solution based on GWOT. Similar to panel a, the untrained model results are averaged over 10 random initializations, with error bars representing the 5th and 95th percentiles. Chance level simulation results are also shown as error bars (5th and 95th percentiles), but both error bars are too small to be visible in this plot. This solution is indicated by the red circle in c. c. A representative example of local optima from GWOT optimization (human vs. CLIP ResNet50 OpenAI). Each point represents a transportation plan obtained from the optimization results.
  • Figure 5: Coarse-grained matching of representational structures between humans and DNNs.a. Category Matching rates between human and model object representations for the minimum GWD solution based on GWOT. Results for the untrained model are averaged over 10 random initializations, with error bars representing the 5th and 95th percentiles of these trials. Simulation results for the chance level are also shown as error bars (5th and 95th percentiles). b. Category Matching rates between human and model object representations for the highest category matching rate solution based on GWOT. Similar to panel a, the untrained model results are averaged over 10 random initializations, with error bars representing the 5th and 95th percentiles, and the chance level simulation results are also shown as error bars (5th and 95th percentiles)
  • ...and 1 more figures