Table of Contents
Fetching ...

On How Iterative Magnitude Pruning Discovers Local Receptive Fields in Fully Connected Neural Networks

William T. Redman, Zhangyang Wang, Alessandro Ingrosso, Sebastian Goldt

TL;DR

The paper addresses why iterative magnitude pruning (IMP) discovers local receptive fields (RFs) in fully connected networks. It tests the hypothesis that IMP amplifies non-Gaussian statistics, via preactivation kurtosis, to create a feedback loop that localizes features, supported by a cavity-score analysis of weight removals and experiments with Gaussian-data clones that lack higher-order cumulants. Key findings show non-Gaussian statistics are necessary for localization, IMP increases preactivation kurtosis more than oneshot pruning, and the pruning order systematically maximizes non-Gaussianity. This provides a parsimonious mechanism for IMP's inductive biases and offers tools, like the cavity method, to analyze and potentially optimize sparse subnetworks across architectures.

Abstract

Since its use in the Lottery Ticket Hypothesis, iterative magnitude pruning (IMP) has become a popular method for extracting sparse subnetworks that can be trained to high performance. Despite its success, the mechanism that drives the success of IMP remains unclear. One possibility is that IMP is capable of extracting subnetworks with good inductive biases that facilitate performance. Supporting this idea, recent work showed that applying IMP to fully connected neural networks (FCNs) leads to the emergence of local receptive fields (RFs), a feature of mammalian visual cortex and convolutional neural networks that facilitates image processing. However, it remains unclear why IMP would uncover localized features in the first place. Inspired by results showing that training on synthetic images with highly non-Gaussian statistics (e.g., sharp edges) is sufficient to drive the emergence of local RFs in FCNs, we hypothesize that IMP iteratively increases the non-Gaussian statistics of FCN representations, creating a feedback loop that enhances localization. Here, we demonstrate first that non-Gaussian input statistics are indeed necessary for IMP to discover localized RFs. We then develop a new method for measuring the effect of individual weights on the statistics of the FCN representations ("cavity method"), which allows us to show that IMP systematically increases the non-Gaussianity of pre-activations, leading to the formation of localized RFs. Our work, which is the first to study the effect of IMP on the statistics of the representations of neural networks, sheds parsimonious light on one way in which IMP can drive the formation of strong inductive biases.

On How Iterative Magnitude Pruning Discovers Local Receptive Fields in Fully Connected Neural Networks

TL;DR

The paper addresses why iterative magnitude pruning (IMP) discovers local receptive fields (RFs) in fully connected networks. It tests the hypothesis that IMP amplifies non-Gaussian statistics, via preactivation kurtosis, to create a feedback loop that localizes features, supported by a cavity-score analysis of weight removals and experiments with Gaussian-data clones that lack higher-order cumulants. Key findings show non-Gaussian statistics are necessary for localization, IMP increases preactivation kurtosis more than oneshot pruning, and the pruning order systematically maximizes non-Gaussianity. This provides a parsimonious mechanism for IMP's inductive biases and offers tools, like the cavity method, to analyze and potentially optimize sparse subnetworks across architectures.

Abstract

Since its use in the Lottery Ticket Hypothesis, iterative magnitude pruning (IMP) has become a popular method for extracting sparse subnetworks that can be trained to high performance. Despite its success, the mechanism that drives the success of IMP remains unclear. One possibility is that IMP is capable of extracting subnetworks with good inductive biases that facilitate performance. Supporting this idea, recent work showed that applying IMP to fully connected neural networks (FCNs) leads to the emergence of local receptive fields (RFs), a feature of mammalian visual cortex and convolutional neural networks that facilitates image processing. However, it remains unclear why IMP would uncover localized features in the first place. Inspired by results showing that training on synthetic images with highly non-Gaussian statistics (e.g., sharp edges) is sufficient to drive the emergence of local RFs in FCNs, we hypothesize that IMP iteratively increases the non-Gaussian statistics of FCN representations, creating a feedback loop that enhances localization. Here, we demonstrate first that non-Gaussian input statistics are indeed necessary for IMP to discover localized RFs. We then develop a new method for measuring the effect of individual weights on the statistics of the FCN representations ("cavity method"), which allows us to show that IMP systematically increases the non-Gaussianity of pre-activations, leading to the formation of localized RFs. Our work, which is the first to study the effect of IMP on the statistics of the representations of neural networks, sheds parsimonious light on one way in which IMP can drive the formation of strong inductive biases.

Paper Structure

This paper contains 26 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: IMP discovers more localized RFs than oneshot magnitude pruning in FCNs. (A) Localized RFs are present after applying IMP for 10 rounds of pruning (each round pruning $s = 30\%$ of the remaining weights), leading to a subnetwork with $s = 97.2\%$. (B) Noisier, less localized RFs are present in the masks found after oneshot pruning FCNs trained on ImageNet32 to $s = 97.2\%$ sparsity. Pruned weights are shown in black and remaining weights are colored by which input channel (red, green, blue) they are connected to. The masks shown correspond to the 120 hidden units with the greatest number of weights remaining pellegrini2022neural.
  • Figure 2: Non-Gaussian statistics contain local information in ImageNet32. (A) By maximizing the non-Gaussanity of a lower dimensional representation of the 50,000 validation images from ImageNet32, ICA extracts features, some of which are localized. (B) In contrast, considering only the covariance of the validation images from ImageNet32 leads PCA to extract features that are periodic and thus, non-local.
  • Figure 3: IMP does not discover local RFs when applied to FCNs trained on a Gaussian clone of ImageNet32. (A) Example images from ImageNet32 and ImageNet32-GP. Note the lack of sharp edges in the case of ImageNet32-GP. (B) Pruning mask found after 10 rounds of IMP. Compare these diffuse masks with the localized masks found on ImageNet32 (Fig. \ref{['fig:ImageNet32_IMP_vs_oneshot']}A). (C) Median RF width for masks found by IMP on ImageNet32 and ImageNet32-GP. The smaller the width, the more localized the mask. Error bars are minimum and maximum of three independently trained and pruned FCNs.
  • Figure 4: IMP increases localization of RFs and preactivation kurtosis in FCNs trained on ImageNet32, to a greater extent than oneshot pruning. (A) Mean RF width, as a function of sparsity induced by IMP (black line), oneshot pruning (blue line), or random pruning (red line). (B) Mean kurtosis of preactivations, per class, as a function of sparsity induced by IMP, oneshot pruning, and random pruning. Note that a kurtosis $>3$ implies more non-Gaussian statistics. In (A)-(B), solid line is mean and shaded area is minimum and maximum of three independently trained and pruned FCNs.
  • Figure 5: IMP selectively prunes weights when their removal would most increase the non-Gaussianity of the preactivations. (A) A schematic overview of how the cavity score (Eq. \ref{['eq:cavity score']}) is computed. For a given unit in the first hidden layer, the kurtosis of its preactivation is computed (top). Then, the preactivation kurtosis is recomputed, with a given weight $W_{ij}$ removed (bottom). If the distribution of preactivations becomes more Gaussian once $W_{ij}$ is removed, $\text{cavity}(W_{ij})<0$. If the distribution of preactivations instead becomes less Gaussian, $\text{cavity}(W_{ij})>0$. (B) Mean cavity score, computed at IMP round 0, for weights grouped according to the round of IMP they ultimately get removed during. Note that the weights that get removed later during IMP have negative cavity score, while weights that get removed early during IMP have positive cavity score. (C) Same as (B), but when computing the cavity score of the remaining weights, $\theta(t_\text{rewind}) \odot m(n - 1)$, after each round of IMP. Gray dashed line is used to highlight the fact that the mean cavity score of the weights that get removed at IMP round 8 is negative for all rounds of IMP, until after the 7$^\text{th}$ round of pruning. In (B)-(C), solid line denote mean, and shaded area is minimum and maximum of three independently trained and pruned FCNs.
  • ...and 4 more figures