Table of Contents
Fetching ...

Make Deep Networks Shallow Again

Bernhard Bermeitinger, Tomas Hrycej, Siegfried Handschuh

TL;DR

The paper investigates whether very deep residual networks are inherently superior to wide, shallow architectures by treating the residual stack as a Taylor-like expansion. It proposes truncating the expansion to the first two terms, yielding a parallel shallow layer that aggregates the original modules, and tests this hypothesis on MNIST and CIFAR-10 across extensive architectural configurations. Empirically, deep sequential and shallow parallel networks achieve similar training and validation losses when parameter counts are matched, with shallow variants sometimes generalizing better on the validation set. These findings suggest that architecture simplification may preserve performance and motivate further study, potentially aided by more robust optimization methods for different depths.

Abstract

Deep neural networks have a good success record and are thus viewed as the best architecture choice for complex applications. Their main shortcoming has been, for a long time, the vanishing gradient which prevented the numerical optimization algorithms from acceptable convergence. A breakthrough has been achieved by the concept of residual connections -- an identity mapping parallel to a conventional layer. This concept is applicable to stacks of layers of the same dimension and substantially alleviates the vanishing gradient problem. A stack of residual connection layers can be expressed as an expansion of terms similar to the Taylor expansion. This expansion suggests the possibility of truncating the higher-order terms and receiving an architecture consisting of a single broad layer composed of all initially stacked layers in parallel. In other words, a sequential deep architecture is substituted by a parallel shallow one. Prompted by this theory, we investigated the performance capabilities of the parallel architecture in comparison to the sequential one. The computer vision datasets MNIST and CIFAR10 were used to train both architectures for a total of 6912 combinations of varying numbers of convolutional layers, numbers of filters, kernel sizes, and other meta parameters. Our findings demonstrate a surprising equivalence between the deep (sequential) and shallow (parallel) architectures. Both layouts produced similar results in terms of training and validation set loss. This discovery implies that a wide, shallow architecture can potentially replace a deep network without sacrificing performance. Such substitution has the potential to simplify network architectures, improve optimization efficiency, and accelerate the training process.

Make Deep Networks Shallow Again

TL;DR

The paper investigates whether very deep residual networks are inherently superior to wide, shallow architectures by treating the residual stack as a Taylor-like expansion. It proposes truncating the expansion to the first two terms, yielding a parallel shallow layer that aggregates the original modules, and tests this hypothesis on MNIST and CIFAR-10 across extensive architectural configurations. Empirically, deep sequential and shallow parallel networks achieve similar training and validation losses when parameter counts are matched, with shallow variants sometimes generalizing better on the validation set. These findings suggest that architecture simplification may preserve performance and motivate further study, potentially aided by more robust optimization methods for different depths.

Abstract

Deep neural networks have a good success record and are thus viewed as the best architecture choice for complex applications. Their main shortcoming has been, for a long time, the vanishing gradient which prevented the numerical optimization algorithms from acceptable convergence. A breakthrough has been achieved by the concept of residual connections -- an identity mapping parallel to a conventional layer. This concept is applicable to stacks of layers of the same dimension and substantially alleviates the vanishing gradient problem. A stack of residual connection layers can be expressed as an expansion of terms similar to the Taylor expansion. This expansion suggests the possibility of truncating the higher-order terms and receiving an architecture consisting of a single broad layer composed of all initially stacked layers in parallel. In other words, a sequential deep architecture is substituted by a parallel shallow one. Prompted by this theory, we investigated the performance capabilities of the parallel architecture in comparison to the sequential one. The computer vision datasets MNIST and CIFAR10 were used to train both architectures for a total of 6912 combinations of varying numbers of convolutional layers, numbers of filters, kernel sizes, and other meta parameters. Our findings demonstrate a surprising equivalence between the deep (sequential) and shallow (parallel) architectures. Both layouts produced similar results in terms of training and validation set loss. This discovery implies that a wide, shallow architecture can potentially replace a deep network without sacrificing performance. Such substitution has the potential to simplify network architectures, improve optimization efficiency, and accelerate the training process.
Paper Structure (9 sections, 11 equations, 5 figures, 2 tables)

This paper contains 9 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the sequential architecture with four consecutive convolutional layers with eight filters each and their skip connections.
  • Figure 2: Overview of the parallelized architecture of \ref{['fig:example-network:sequential']} with four convolutional layers with eight filters each and one skip connection.
  • Figure 3: Sequential vs. parallel architecture: loss dependence on the number of residual convolutional layers (with a single filter per layer) for the two datasets MNIST (left) and CIFAR10 (right)
  • Figure 4: Sequential vs. parallel architecture: loss dependence on the number of filters (with 16 convolutional layers) for the two datasets MNIST (left) and CIFAR10 (right)
  • Figure 5: Sequential vs. parallel architecture: loss dependence on the ratio of the numbers of layers and filters (product of the number of layers and the number of filters is fixed at 32) for the two datasets MNIST (left) and CIFAR10 (right)