Table of Contents
Fetching ...

Layerwise complexity-matched learning yields an improved model of cortical area V2

Nikhil Parthasarathy, Olivier J. Hénaff, Eero P. Simoncelli

TL;DR

This work demonstrates that the layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2, and demonstrates that the complexity-matched learning paradigm is responsible for much of the emergence of the improved biological alignment.

Abstract

Human ability to recognize complex visual patterns arises through transformations performed by successive areas in the ventral visual cortex. Deep neural networks trained end-to-end for object recognition approach human capabilities, and offer the best descriptions to date of neural responses in the late stages of the hierarchy. But these networks provide a poor account of the early stages, compared to traditional hand-engineered models, or models optimized for coding efficiency or prediction. Moreover, the gradient backpropagation used in end-to-end learning is generally considered to be biologically implausible. Here, we overcome both of these limitations by developing a bottom-up self-supervised training methodology that operates independently on successive layers. Specifically, we maximize feature similarity between pairs of locally-deformed natural image patches, while decorrelating features across patches sampled from other images. Crucially, the deformation amplitudes are adjusted proportionally to receptive field sizes in each layer, thus matching the task complexity to the capacity at each stage of processing. In comparison with architecture-matched versions of previous models, we demonstrate that our layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2. We demonstrate that the complexity-matched learning paradigm is responsible for much of the emergence of the improved biological alignment. Finally, when the two-stage model is used as a fixed front-end for a deep network trained to perform object recognition, the resultant model (LCL-V2Net) is significantly better than standard end-to-end self-supervised, supervised, and adversarially-trained models in terms of generalization to out-of-distribution tasks and alignment with human behavior.

Layerwise complexity-matched learning yields an improved model of cortical area V2

TL;DR

This work demonstrates that the layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2, and demonstrates that the complexity-matched learning paradigm is responsible for much of the emergence of the improved biological alignment.

Abstract

Human ability to recognize complex visual patterns arises through transformations performed by successive areas in the ventral visual cortex. Deep neural networks trained end-to-end for object recognition approach human capabilities, and offer the best descriptions to date of neural responses in the late stages of the hierarchy. But these networks provide a poor account of the early stages, compared to traditional hand-engineered models, or models optimized for coding efficiency or prediction. Moreover, the gradient backpropagation used in end-to-end learning is generally considered to be biologically implausible. Here, we overcome both of these limitations by developing a bottom-up self-supervised training methodology that operates independently on successive layers. Specifically, we maximize feature similarity between pairs of locally-deformed natural image patches, while decorrelating features across patches sampled from other images. Crucially, the deformation amplitudes are adjusted proportionally to receptive field sizes in each layer, thus matching the task complexity to the capacity at each stage of processing. In comparison with architecture-matched versions of previous models, we demonstrate that our layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2. We demonstrate that the complexity-matched learning paradigm is responsible for much of the emergence of the improved biological alignment. Finally, when the two-stage model is used as a fixed front-end for a deep network trained to perform object recognition, the resultant model (LCL-V2Net) is significantly better than standard end-to-end self-supervised, supervised, and adversarially-trained models in terms of generalization to out-of-distribution tasks and alignment with human behavior.
Paper Structure (35 sections, 3 equations, 13 figures, 5 tables)

This paper contains 35 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: DNN object recognition performance predicts human recognition behavior, but not primate early visual responses. Each plotted point corresponds to a DNN model from the BrainScore database schrimpf2018brain). Horizontal axis of both panels indicates recognition accuracy (top-1) on the ImageNet dataset krizhevsky2012imagenet. Left: Comparison to alignment with human visual recognition performance (combination of benchmarks taken from geirhos2021partial and rajalingham2018large). Right: Comparison to neural variance explained by regressing the best-fitting DNN layer to neural responses measured in macaque V1 (green), V2 (blue) freeman2013functionalziemba2016selectivity and IT (black) majaj2015simpleSanghaviDiCarlo2021SanghaviMurtyDiCarlo2021SanghaviJozwikDiCarlo2021
  • Figure 2: Layerwise complexity-matched learning. Top: The standard end-to-end (E2E) learning paradigm used with DNNs. The loss function ($L_{E2E}$) operates on the network output and is typically chosen to favor object-level invariances, through supervised training on labelled data or self-supervised training on augmented examples. To solve these E2E objectives, the network $f(\theta)$, must have a high model capacity (sufficiently large number of parameters and non-linearities). Bottom: In a layerwise training system, the loss is a function of all intermediate outputs ($z_1$, $z_2$,...). Losses at each layer $L_l$ are used to train each encoder stage $f_{\theta_l}$ independently, with gradients operating only within stages. For effective training, we hypothesize that the loss at each stage, $L_l$, should be matched in complexity to the model capacity defined by the network up to layer $l$.
  • Figure 3: Layerwise complexity-matched objective.Left: For each layer, the objective encourages invariance to feature perturbations by comparing the representation of two augmented views of the same image. For layer $l$, the feature complexity of generated image pair ($x^A_l$, $x^B_l$) is controlled through choice of patch size, and the magnitude of spatial deformations (translation, dilation). Right: The parameters $\theta_1$ of the first layer encoder $f_{\theta_1}$ are updated using the Barlow Twins feature-contrastive loss zbontar2021barlow operating on the two views of the smallest patch size ($x^{A}_1$, $x^{B}_1$). This set of views is only propagated to this layer output. The parameters $\theta_2$ of the second layer encoder $f_{\theta_2}$ are updated with the same loss, but using the views that cover a larger spatial region, and include larger spatial deformations.
  • Figure 4: The LCL-V2 model outperforms other models in accounting for V2 responses.Left: Median explained variance of models fitted with PLS regression to 103 primate V2 neural responses. For models with more than two layers, all layers are evaluated and the performance of the best layer is provided. Standard deviations over 10-fold cross-validated regression are indicated on each bar. Right: Comparison of median explained variance for "V1-like" and "V2-like" V2 cells. These categories correspond to the top and bottom quartiles (N=26) of V2 cells sorted by how well they are fit by a canonical hand-constructed V1 model (V1-SteerPyr+Pool). The minimum explained variance of the V1 model over the set of "V1-like" neurons is 57 %. The maximum explained variance of the V1 model over the set of "V2-like" neurons is 4 %. The LCL-V2 and L2-AT models significantly outperform all other models on the V1-like subset, even surpassing the baseline V1 model. The LCL-V2 model also significantly outperforms the L2-AT model on the V2-like subset.
  • Figure 5: The LCL-V2 model outperforms other models in capturing texture modulation properties of V2 neurons. We compare the top 3 (in terms of overall V2 predictivity) fully-learned models: LCL-V2 (Ours), L2-AT, and Supervised. Top: Quantile-quantile (Q-Q) comparison of the distribution of texture modulation index values ($R_{mod}$, averaged over texture families) for real and model neurons. LCL-V2 shows better alignment with the physiological distribution (closer to the identity line (dashed)) than the other two models. Bottom: Comparison of texture modulation indices for each of 15 texture families (averaged over neurons). The modulation indices for model and real neurons are ranked (1 = lowest modulation family, 15 = highest modulation family), and plotted against each other. Our model provides better alignment with the V2 data, achieving a Spearman rank correlation of $\rho= 0.9$. P-values were computed to test significance of the difference between the Spearman correlations using the methodology described in Sec. \ref{['sec:texmod_appendix']}. We find the difference to be significant vs. both models (L2-AT, Supervised) with (p = 0.040, p=0.047) respectively.
  • ...and 8 more figures