Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Yudi Xie; Weichen Huang; Esther Alter; Jeremy Schwartz; Joshua B. Tenenbaum; James J. DiCarlo

Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo

TL;DR

This work questions whether the primate ventral stream is optimized solely for object categorization or also for spatial latent estimation, and tests this by training CNNs on synthetic TDW datasets to predict spatial latents such as $X$, $Y$, $Z$, and $R^{xy}$, $R^{yz}$, $R^{zx}$. Using Brain-Score benchmarks and Centered Kernel Alignment, the authors show that CNNs trained to estimate a small number of spatial latents achieve neural alignment comparable to categorization-trained models, and that spatial-latent and category-trained models share similar representations in early/mid layers. They further demonstrate that non-target latent variability in the training data helps learn representations of joint latents, partially explaining convergence across different training objectives. The findings suggest ventral-stream representations are not limited to categorization but can support multiple visual inferences, highlighting the need for more sensitive brain-model comparison measures and a broader view of ventral-stream function with synthetic-data methodologies.

Abstract

Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.

Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

TL;DR

Abstract

Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)