Table of Contents
Fetching ...

Towards a More Complete Theory of Function Preserving Transforms

Michael Painter

TL;DR

A derivation for R2R is provided and it is shown that it yields competitive performance with other function preserving transforms, thereby decreasing the restrictions on deep learning architectures that can be extended through function preserving transforms.

Abstract

In this paper, we develop novel techniques that can be used to alter the architecture of a neural network, while maintaining the function it represents. Such operations are known as function preserving transforms and have proven useful in transferring knowledge between networks to evaluate architectures quickly, thus having applications in efficient architectures searches. Our methods allow the integration of residual connections into function preserving transforms, so we call them R2R. We provide a derivation for R2R and show that it yields competitive performance with other function preserving transforms, thereby decreasing the restrictions on deep learning architectures that can be extended through function preserving transforms. We perform a comparative analysis with other function preserving transforms such as Net2Net and Network Morphisms, where we shed light on their differences and individual use cases. Finally, we show the effectiveness of R2R to train models quickly, as well as its ability to learn a more diverse set of filters on image classification tasks compared to Net2Net and Network Morphisms.

Towards a More Complete Theory of Function Preserving Transforms

TL;DR

A derivation for R2R is provided and it is shown that it yields competitive performance with other function preserving transforms, thereby decreasing the restrictions on deep learning architectures that can be extended through function preserving transforms.

Abstract

In this paper, we develop novel techniques that can be used to alter the architecture of a neural network, while maintaining the function it represents. Such operations are known as function preserving transforms and have proven useful in transferring knowledge between networks to evaluate architectures quickly, thus having applications in efficient architectures searches. Our methods allow the integration of residual connections into function preserving transforms, so we call them R2R. We provide a derivation for R2R and show that it yields competitive performance with other function preserving transforms, thereby decreasing the restrictions on deep learning architectures that can be extended through function preserving transforms. We perform a comparative analysis with other function preserving transforms such as Net2Net and Network Morphisms, where we shed light on their differences and individual use cases. Finally, we show the effectiveness of R2R to train models quickly, as well as its ability to learn a more diverse set of filters on image classification tasks compared to Net2Net and Network Morphisms.

Paper Structure

This paper contains 30 sections, 33 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An overview of R2WiderR. The purple volumes represent ${ {x}^{i}_{L} }$ and ${ {x}^{i}_{R} }$ in the widened volume ${\bar{x}}^{i}$. We label the convolution operations with their initializations, and it shows that ${ {x}^{i}_{L} }$ and ${ {x}^{i}_{R} }$ cancel each other out in ${x}^{i+1}$.
  • Figure 2: A schematic of how to adapt simple residual connections for the R2WiderR operation, in the simple case. The purple volumes indicate the new parameters in ${\bar{x}}^{\ell}$, and $g$ represents the composition of layers $\ell+1$ to $i-1$ in the network.
  • Figure 3: A schematic of the R2DeeperR operation. Before R2DeeperR is applied, ${x}^{o_1}$ and ${x}^{o_2}$ can be ignored, and the network would compute ${x}^{i+1}$ directly from ${x}^{i}$. After R2DeeperR is applied ${U}^{o_1}$ is used to make two identical volumes, which cancel each other out in volume ${x}^{o_2}$, so that ${x}^{o_2}=0$. The residual connection assures thaen that ${x}^{e}={x}^{i}$.
  • Figure 4: Visualization of weights from a $7\times 7$ convolution layer trained on Cifar-10. From left to right we have R2R, Net2Net and NetMorph schemes. On the top row we have visualizations immediately after the FPT operation, and the bottom row shows the filters after they have all been trained to convergence.
  • Figure 5: Validation curves comparing a student network using each of Net2WiderNet, R2WiderR, NetMorph, with baselines of random padding and training the ResNet "from scratch".
  • ...and 5 more figures