Table of Contents
Fetching ...

Neural Network Training Techniques Regularize Optimization Trajectory: An Empirical Study

Cheng Chen, Junjie Yang, Yi Zhou

TL;DR

The paper addresses why common DNN training techniques accelerate optimization by introducing a regularity principle for stochastic nonconvex optimization, parameterized by $\gamma$, which enforces alignment between updates and the optimization trajectory. It proves a sublinear convergence bound of $\mathcal{O}(1/(\gamma T))$ and demonstrates, through extensive experiments on AlexNet, VGG, ResNet, and U-Net, that training techniques raise $\gamma$ and yield faster convergence. The study systematically analyzes activation functions, batch normalization, skip-connections, and optimizers, showing that each technique contributes to trajectory regularization and faster progress according to the principle. This provides a quantitative metric for the regularization effect of training practices and offers guidance for designing training strategies with improved convergence behavior in deep learning.

Abstract

Modern deep neural network (DNN) trainings utilize various training techniques, e.g., nonlinear activation functions, batch normalization, skip-connections, etc. Despite their effectiveness, it is still mysterious how they help accelerate DNN trainings in practice. In this paper, we provide an empirical study of the regularization effect of these training techniques on DNN optimization. Specifically, we find that the optimization trajectories of successful DNN trainings consistently obey a certain regularity principle that regularizes the model update direction to be aligned with the trajectory direction. Theoretically, we show that such a regularity principle leads to a convergence guarantee in nonconvex optimization and the convergence rate depends on a regularization parameter. Empirically, we find that DNN trainings that apply the training techniques achieve a fast convergence and obey the regularity principle with a large regularization parameter, implying that the model updates are well aligned with the trajectory. On the other hand, DNN trainings without the training techniques have slow convergence and obey the regularity principle with a small regularization parameter, implying that the model updates are not well aligned with the trajectory. Therefore, different training techniques regularize the model update direction via the regularity principle to facilitate the convergence.

Neural Network Training Techniques Regularize Optimization Trajectory: An Empirical Study

TL;DR

The paper addresses why common DNN training techniques accelerate optimization by introducing a regularity principle for stochastic nonconvex optimization, parameterized by , which enforces alignment between updates and the optimization trajectory. It proves a sublinear convergence bound of and demonstrates, through extensive experiments on AlexNet, VGG, ResNet, and U-Net, that training techniques raise and yield faster convergence. The study systematically analyzes activation functions, batch normalization, skip-connections, and optimizers, showing that each technique contributes to trajectory regularization and faster progress according to the principle. This provides a quantitative metric for the regularization effect of training practices and offers guidance for designing training strategies with improved convergence behavior in deep learning.

Abstract

Modern deep neural network (DNN) trainings utilize various training techniques, e.g., nonlinear activation functions, batch normalization, skip-connections, etc. Despite their effectiveness, it is still mysterious how they help accelerate DNN trainings in practice. In this paper, we provide an empirical study of the regularization effect of these training techniques on DNN optimization. Specifically, we find that the optimization trajectories of successful DNN trainings consistently obey a certain regularity principle that regularizes the model update direction to be aligned with the trajectory direction. Theoretically, we show that such a regularity principle leads to a convergence guarantee in nonconvex optimization and the convergence rate depends on a regularization parameter. Empirically, we find that DNN trainings that apply the training techniques achieve a fast convergence and obey the regularity principle with a large regularization parameter, implying that the model updates are well aligned with the trajectory. On the other hand, DNN trainings without the training techniques have slow convergence and obey the regularity principle with a small regularization parameter, implying that the model updates are not well aligned with the trajectory. Therefore, different training techniques regularize the model update direction via the regularity principle to facilitate the convergence.

Paper Structure

This paper contains 18 sections, 1 theorem, 9 equations, 16 figures, 1 table.

Key Result

Theorem 1

Apply SA to solve the over-parameterized problem (P) and generate an optimization trajectory $\{\theta_0, \theta_1, ..., \theta_T \}$. If the optimization trajectory satisfies the regularity principle with parameter $\gamma>0$, then after $T=nB, B\in\mathbb{N}$ iterations (i.e., $B$ epochs), the ave

Figures (16)

  • Figure 1: Illustration of optimization trajectories. The left trajectory satisfies regularity principle with large $\gamma$ and the right trajectory satisfies regularity principle with small $\gamma$.
  • Figure 2: Training ResNet-18 with different activation functions.
  • Figure 3: Training U-Net with different activation functions.
  • Figure 4: Training VGGs with and without batch normalization on CIFAR-10.
  • Figure 5: Training VGGs with and without batch normalization on CIFAR-100.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Definition 1: Optimization trajectory
  • Definition 2: Regularity principle for SA
  • Theorem 1: Convergence under Regularity Principle