Table of Contents
Fetching ...

Inhibitory normalization of error signals improves learning in neural circuits

Roy Henha Eyono, Daniel Levenstein, Arna Ghosh, Jonathan Cornford, Blake Richards

Abstract

Normalization is a critical operation in neural circuits. In the brain, there is evidence that normalization is implemented via inhibitory interneurons and allows neural populations to adjust to changes in the distribution of their inputs. In artificial neural networks (ANNs), normalization is used to improve learning in tasks that involve complex input distributions. However, it is unclear whether inhibition-mediated normalization in biological neural circuits also improves learning. Here, we explore this possibility using ANNs with separate excitatory and inhibitory populations trained on an image recognition task with variable luminosity. We find that inhibition-mediated normalization does not improve learning if normalization is applied only during inference. However, when this normalization is extended to include back-propagated errors, performance improves significantly. These results suggest that if inhibition-mediated normalization improves learning in the brain, it additionally requires the normalization of learning signals.

Inhibitory normalization of error signals improves learning in neural circuits

Abstract

Normalization is a critical operation in neural circuits. In the brain, there is evidence that normalization is implemented via inhibitory interneurons and allows neural populations to adjust to changes in the distribution of their inputs. In artificial neural networks (ANNs), normalization is used to improve learning in tasks that involve complex input distributions. However, it is unclear whether inhibition-mediated normalization in biological neural circuits also improves learning. Here, we explore this possibility using ANNs with separate excitatory and inhibitory populations trained on an image recognition task with variable luminosity. We find that inhibition-mediated normalization does not improve learning if normalization is applied only during inference. However, when this normalization is extended to include back-propagated errors, performance improves significantly. These results suggest that if inhibition-mediated normalization improves learning in the brain, it additionally requires the normalization of learning signals.
Paper Structure (30 sections, 46 equations, 7 figures, 1 table)

This paper contains 30 sections, 46 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Schematic of the Excitatory-Inhibitory (EI) network with Layer Normalization and the perceptual invariance task.a: Feedforward EI network architecture. Gray units represents the Excitatory (E) population, and purple unit represents the Inhibitory (I) population. Outgoing synaptic weights share the same sign ($\text{E}, +$ or $\text{I}, -$). b: To test perceptual invariance, a shift is applied to each individual image in the FashionMNIST dataset during both training and testing. For every image, a unique constant $\Delta$ is sampled within the threshold $|\Delta| < \epsilon$. The figure displays three example augmentations to illustrate how the shift varies across the allowable range.
  • Figure 2: Layer normalization (LN) improves perceptual invariance in Excitatory-Inhibitory (EI) networks.a: Test accuracy (Acc %) comparison of EI networks with LN (x-axis) to those without LN (y-axis). Data points represent performance across 30 hyperparameter combinations (layer widths and E/I learning rates) and four luminosity ranges ($\epsilon = 0, 0.25, 0.5, 0.75$). Points below the dashed diagonal line indicate cases where networks with LN performed better. b: Top-10 test accuracy comparison of an EI network with LN against an Excitatory-only (E-only) network, also with LN. The box plots summarize the distribution across the same 30 hyperparameter combinations reported in panel $\mathbf{a}$.
  • Figure 3: Inhibitory populations learn to implement layer normalization of excitatory activity.a: Schematic showing how the inhibitory circuit (purple) is trained locally via the $\mathcal{L}_{\text{I-Norm}}$ loss (purple lines) to normalize excitatory activity. Excitatory to excitatory weights are updated only by the task loss $\mathcal{L}_{\text{Task}}$ (dotted left arrow). Forward Pass: Inhibition performs subtractive ($-$) and divisive ($\div$) modulation. Backward Pass: Inhibitory gradients enforce layer-normalized excitatory statistics. b: Box-and-whisker plots of the first and second moments of excitatory activations. Each plot compares three conditions: No-Norm, Subtractive-only I-Norm (sub), and I-Norm (as depicted in a). Each box plot summarizes model results aggregated across the sampled range of $\epsilon$ luminosity augmentations.
  • Figure 4: I-Norm networks struggle to recapitulate the performance of LN in EI networks. a: Test accuracy comparison between LN (x-axis) and I-Norm (y-axis) across hyperparameter and luminosity ranges ($\epsilon = 0, 0.25, 0.5, 0.75$). Each point represents a single network training run. Points falling below the dashed diagonal line indicate cases where LN achieved higher test accuracy than I-Norm. b: Layer-wise alignment between I-Norm and LN. We quantify alignment as the cosine similarity between I-Norm and LayerNorm (LN) across all network layers for outputs (top) and gradients (bottom). All results correspond to the highest luminosity range ($\epsilon = 0.75$).
  • Figure 5: Hard-coded LN gradients in I-Norm networks restore LN performance in EI networks.a: Schematic illustrating the Backward Pass of the I-Norm network incorporating, GradNorm, from equation \ref{['eqn: ln_der']}. The $\text{GradNorm}$ operation is applied to the backward signal ($\delta$) to enforce the LN gradient. b: Average cosine similarity across all network layers. The top and bottom panels show the similarity between LN and $\text{I-Norm (with GradNorm)}$ for outputs and gradients, respectively. c: Test accuracy (Acc %) comparison. LN network performance (x-axis) versus I-Norm network with $\text{GradNorm}$ (y-axis). Data is shown across 30 hyperparameter initializations and four luminosity ranges ($\epsilon = 0, 0.25, 0.5, 0.75$). Points clustered along the dashed diagonal line indicate a strong match in performance between the two models.
  • ...and 2 more figures