Table of Contents
Fetching ...

Weight-Sharing Regularization

Mehran Shakerinava, Motahareh Sohrabi, Siamak Ravanbakhsh, Simon Lacoste-Julien

TL;DR

The proximal mapping of $\mathcal{R}$ is studied and an intuitive interpretation of it in terms of a physical system of interacting particles is provided, which enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting.

Abstract

Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights $w \in \mathbb{R}^d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\operatorname{prox}_\mathcal{R}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log^3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.

Weight-Sharing Regularization

TL;DR

The proximal mapping of is studied and an intuitive interpretation of it in terms of a physical system of interacting particles is provided, which enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting.

Abstract

Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights of a neural network, defined as . We study the proximal mapping of and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for (to run on GPU) and find that one of them is fast in practice but slow () for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.
Paper Structure (36 sections, 9 theorems, 46 equations, 9 figures, 4 tables, 5 algorithms)

This paper contains 36 sections, 9 theorems, 46 equations, 9 figures, 4 tables, 5 algorithms.

Key Result

lemma 2

The ODE preserves weight-sharing. Formally, for all $t$,

Figures (9)

  • Figure 1: Depiction of the unregularized error function's contour lines (in blue), together with the constraint area for $\mathcal{R} + \ell_1$, where the optimal parameter vector $w$ is marked by $w^\star$. Notice that the weights are set to be equal (shared) in the solution.
  • Figure 2: Sample digits from MNIST on a torus.
  • Figure 3: The emergence of learned convolution-like filters in a fully connected network with weight-sharing regularization on CIFAR10.
  • Figure 4: A random sample of learned filters from the 256 filters with the highest number of non-zero weights.
  • Figure 5: Test accuracy vs. dataset ratio on MNIST on torus for various models.
  • ...and 4 more figures

Theorems & Definitions (22)

  • example 1
  • lemma 2: Weight-sharing
  • lemma 3: Monotonic inclusion
  • theorem 4: Proximal mapping of $\mathcal{R}$
  • proposition 5: Conservation of Momentum
  • proposition 6: Rightmost collision
  • theorem 7
  • definition 8: Subgradient
  • definition 9: Subdifferential
  • proposition 10: Subdifferential of $\max$
  • ...and 12 more