Table of Contents
Fetching ...

Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo

TL;DR

This work questions whether the primate ventral stream is optimized solely for object categorization or also for spatial latent estimation, and tests this by training CNNs on synthetic TDW datasets to predict spatial latents such as $X$, $Y$, $Z$, and $R^{xy}$, $R^{yz}$, $R^{zx}$. Using Brain-Score benchmarks and Centered Kernel Alignment, the authors show that CNNs trained to estimate a small number of spatial latents achieve neural alignment comparable to categorization-trained models, and that spatial-latent and category-trained models share similar representations in early/mid layers. They further demonstrate that non-target latent variability in the training data helps learn representations of joint latents, partially explaining convergence across different training objectives. The findings suggest ventral-stream representations are not limited to categorization but can support multiple visual inferences, highlighting the need for more sensitive brain-model comparison measures and a broader view of ventral-stream function with synthetic-data methodologies.

Abstract

Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.

Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

TL;DR

This work questions whether the primate ventral stream is optimized solely for object categorization or also for spatial latent estimation, and tests this by training CNNs on synthetic TDW datasets to predict spatial latents such as , , , and , , . Using Brain-Score benchmarks and Centered Kernel Alignment, the authors show that CNNs trained to estimate a small number of spatial latents achieve neural alignment comparable to categorization-trained models, and that spatial-latent and category-trained models share similar representations in early/mid layers. They further demonstrate that non-target latent variability in the training data helps learn representations of joint latents, partially explaining convergence across different training objectives. The findings suggest ventral-stream representations are not limited to categorization but can support multiple visual inferences, highlighting the need for more sensitive brain-model comparison measures and a broader view of ventral-stream function with synthetic-data methodologies.

Abstract

Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.

Paper Structure

This paper contains 20 sections, 13 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Spatial latent variables and our training dataset.(a) An illustration of the set of spatial latents that are available in the dataset for training models. In addition to the object category (Apple), the set of spatial latent variables we record are the following: translation (X, Y), distance (Z), and rotation ($R^{xy}$, $R^{yz}$, $R^{zx}$). (This image is for illustration only, not in the synthetic dataset.) (b) Example images in our dataset (TDW-117) for training CNNs. Each image contains one object with varying positions and poses against a random background.
  • Figure 2: Learning a small number of spatial latents produces ventral-stream-aligned CNN models.(a) The neural alignment of models (ResNet-50) trained on different objectives (x-axis, number of output units, see panel b). Learning a few spatial latent variables produced models that have neural alignment scores comparable to models trained on hundreds of categories. Error bars or shaded regions show the SD across multiple random seeds (N=5). All spatial + classification = all spatial and all classification tasks combined. TDW-117 is our TDW dataset with 117 object categories; TDW-N means datasets with N = 2,4,6,8,16 categories. For a breakdown of the individual region alignment scores, see \ref{['fig:suppfig_resnet50brainscore']}. (b) The training tasks we investigated and their corresponding number of output units that receive supervision during training.
  • Figure 3: The neural alignment of models correlates strongly with their spatial task performance.(a) The neural alignment scores correlate with models' categorization performance for models trained on object categories. This figure shows results from multiple random initializations. Each dot shows a ResNet-50 model colored by the number of training batches. (b-d) Models' neural alignment scores correlate with their spatial latent estimation performance when they are trained to estimate those latents respectively. The figures show models trained on (b) distance regression, (c) translation regression, (d) rotation regression (1 outlier out of 60 data points where the loss is larger than 0.2 is excluded). For a breakdown of the averaged score into individual scores, see \ref{['fig:suppfig_perf_bscore_breakdown']} and \ref{['tab:model_perf_bscore_corr']}.
  • Figure 4: Models trained to estimate different latents learned similar representations.(a) Pair-wise similarity (CKA) between models trained on different targets at 4 different layers. Until the last few layers (e.g., layer4.0.relu), the representations of models trained on different spatial and category latents remain highly similar. Off-diagonal entries are the averaged pair-wise similarity between models in the two groups. Diagonal entries are averaged similarity between different randomly initialized models in the same group. (All spatial + cla. means models trained on all spatial and classification tasks) (b) The similarity between category classification models and models trained on other targets. Models trained on spatial latents remain similar to category models until the last few layers. Layer names as in \ref{['tab:resnet_18_archi']}. intra - Obj. Category shows the averaged distance between categorization models trained with different random initializations.
  • Figure 5: Non-target latent variability helps learn better representations of the joint latents in intermediate layers.(a) We compared the decoding performance of non-target latents at different layers of the models. Using a model trained to estimate some target latents on a dataset that has reduced non-target latent variability as a baseline, we can expect two different outcomes. First (H1), if a model trained with full non-target latent variability decodes the non-target latents better, that indicates that the non-target latent variability facilitates the learning of that non-target latent, although the models are not trained to estimate them. Second (H2), it is also possible that additional non-target latent variability makes models become invariant to that non-target latent, thus worse decoding performance for that non-target latent. Our experiments suggested that models learned better representations of the non-target latent with additional non-target latent variability, supporting H1. (b) Decoding performance for the non-target latent -- category -- when the models are trained to estimate target latents -- distance (top), translation (middle), rotation (bottom). Non-target latent variability helped models learn better representations of these latents in the intermediate layers. (cat. var. -- category variability). (c) Similar results were seen when the non-target latent is translation, and the target latent is distance (top), rotation (middle), and category (bottom). When the target latent is category, models learn better representations of translation with additional translation variability, although they became more translation invariant in the last two layers. (tran. var. -- translation variability). The results for another translation latent Y is similar (\ref{['fig:suppfig_decode_1']}). (Error bars show the SD across 5 cross-validation runs $\times$ 6 randomly initialized models. "*" indicates a significant difference between the two groups, Mann-Whitney U test, p value $<$ 0.05). Layer names as in \ref{['tab:resnet_18_archi']}.
  • ...and 16 more figures