Table of Contents
Fetching ...

ResidualDroppath: Enhancing Feature Reuse over Residual Connections

Sejik Park

TL;DR

This work identifies and analyze the limitations of feature reuse with vanilla residual connections, and provides an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training.

Abstract

Residual connections are one of the most important components in neural network architectures for mitigating the vanishing gradient problem and facilitating the training of much deeper networks. One possible explanation for how residual connections aid deeper network training is by promoting feature reuse. However, we identify and analyze the limitations of feature reuse with vanilla residual connections. To address these limitations, we propose modifications in training methods. Specifically, we provide an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training. The first type of iteration involves using droppath, which enforces feature reuse by randomly dropping a subset of layers. The second type of iteration focuses on training the dropped parts of the model while freezing the undropped parts. As a result, the dropped parts learn in a way that encourages feature reuse, as the model relies on the undropped parts with feature reuse in mind. Overall, we demonstrated performance improvements in models with residual connections for image classification in certain cases.

ResidualDroppath: Enhancing Feature Reuse over Residual Connections

TL;DR

This work identifies and analyze the limitations of feature reuse with vanilla residual connections, and provides an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training.

Abstract

Residual connections are one of the most important components in neural network architectures for mitigating the vanishing gradient problem and facilitating the training of much deeper networks. One possible explanation for how residual connections aid deeper network training is by promoting feature reuse. However, we identify and analyze the limitations of feature reuse with vanilla residual connections. To address these limitations, we propose modifications in training methods. Specifically, we provide an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training. The first type of iteration involves using droppath, which enforces feature reuse by randomly dropping a subset of layers. The second type of iteration focuses on training the dropped parts of the model while freezing the undropped parts. As a result, the dropped parts learn in a way that encourages feature reuse, as the model relies on the undropped parts with feature reuse in mind. Overall, we demonstrated performance improvements in models with residual connections for image classification in certain cases.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Feature Reuse Across Multiple Layers. It visualizes the feature distribution of a model trained with linear layers of depth 32 and a hidden dimension of 32, with residual connections added at each layer. Despite the presence of residual connections, it shows that the model produces similar feature distributions through multiple transformations of the previous layer's feature distribution. This could be disadvantageous from the perspective of information retention.
  • Figure 2: Toy Dataset. It visualizes 400 sampled points from the toy dataset based on the spiral function. Blue and red indicate the class of each point.
  • Figure 3: Feature Similarity over Layers with a Model of Depth 32 and Dimension 32. The similarity heatmap shows that similarity decreases and increases across multiple layers, which indirectly indicates that the model is performing multiple transformations of the previous layer’s feature distribution to obtain similar feature distributions.
  • Figure 4: Feature Visualization during Training with a Model of Depth 6 and Dimension 6. It shows cases where some nodes achieve similar distributions across multiple layers.
  • Figure 5: ResidualDroppath. It visualizes the operation of our proposed algorithm at the block level. There are two iteration stages. In the first stage, as shown in (b), Droppath is performed. In the second stage, paths that were not dropped in the first stage are frozen, while the remaining paths remain trainable. During inference, the process follows the conventional network flow, as illustrated in (a).