Weight-Sharing Regularization

Mehran Shakerinava; Motahareh Sohrabi; Siamak Ravanbakhsh; Simon Lacoste-Julien

Weight-Sharing Regularization

Mehran Shakerinava, Motahareh Sohrabi, Siamak Ravanbakhsh, Simon Lacoste-Julien

TL;DR

The proximal mapping of $\mathcal{R}$ is studied and an intuitive interpretation of it in terms of a physical system of interacting particles is provided, which enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting.

Abstract

Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights $w \in \mathbb{R}^d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\operatorname{prox}_\mathcal{R}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log^3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.

Weight-Sharing Regularization

TL;DR

The proximal mapping of

is studied and an intuitive interpretation of it in terms of a physical system of interacting particles is provided, which enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting.

Abstract

Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights

of a neural network, defined as

. We study the proximal mapping of

and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for

(to run on GPU) and find that one of them is fast in practice but slow (

) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in

when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.

Paper Structure (36 sections, 9 theorems, 46 equations, 9 figures, 4 tables, 5 algorithms)

This paper contains 36 sections, 9 theorems, 46 equations, 9 figures, 4 tables, 5 algorithms.

INTRODUCTION
Contributions
Related Works
Outline
SUBDIFFERENTIAL OF $\mathcal{R}$
PROXIMAL MAPPING OF $\mathcal{R}$
ALGORITHMS FOR $\mathop{\mathrm{prox}}\nolimits_\mathcal{R}$
Imminent Collisions Algorithm
End Collisions Algorithm
Search Collisions Algorithm
REWINDING
EXPERIMENTS
MNIST on a Torus
CIFAR10
FUTURE WORK
...and 21 more sections

Key Result

lemma 2

The ODE preserves weight-sharing. Formally, for all $t$,

Figures (9)

Figure 1: Depiction of the unregularized error function's contour lines (in blue), together with the constraint area for $\mathcal{R} + \ell_1$, where the optimal parameter vector $w$ is marked by $w^\star$. Notice that the weights are set to be equal (shared) in the solution.
Figure 2: Sample digits from MNIST on a torus.
Figure 3: The emergence of learned convolution-like filters in a fully connected network with weight-sharing regularization on CIFAR10.
Figure 4: A random sample of learned filters from the 256 filters with the highest number of non-zero weights.
Figure 5: Test accuracy vs. dataset ratio on MNIST on torus for various models.
...and 4 more figures

Theorems & Definitions (22)

example 1
lemma 2: Weight-sharing
lemma 3: Monotonic inclusion
theorem 4: Proximal mapping of $\mathcal{R}$
proposition 5: Conservation of Momentum
proposition 6: Rightmost collision
theorem 7
definition 8: Subgradient
definition 9: Subdifferential
proposition 10: Subdifferential of $\max$
...and 12 more

Weight-Sharing Regularization

TL;DR

Abstract

Weight-Sharing Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (22)