Table of Contents
Fetching ...

SGD with Partial Hessian for Deep Neural Networks Optimization

Ying Sun, Hongwei Yong, Lei Zhang

TL;DR

SGD-PH addresses the practical challenge of leveraging second-order information in deep neural network optimization by exploiting a diagonal Hessian structure for channel-wise 1D parameters (e.g., BN gamma/beta and WN gamma) and applying precise Newton-type updates to these parameters, while updating all other parameters with a standard first-order SGD. The method uses Hessian-free backpropagation to extract the diagonal partial Hessian, along with rectification and momentum techniques to handle nonconvexity and stochasticity. Empirical results across CIFAR-10/100, Mini-ImageNet, and ImageNet show SGD-PH often outperforms both first-order and some second-order optimizers, improves generalization on deeper models, and remains effective in networks without BN by leveraging weight normalization. However, the approach incurs higher time and memory costs than pure SGD, highlighting a trade-off between optimization quality and computational resources. Overall, SGD-PH demonstrates that selective, precise second-order information can enhance optimization in deep learning without sacrificing generalization, offering a practical pathway for incorporating second-order cues in large-scale networks.

Abstract

Due to the effectiveness of second-order algorithms in solving classical optimization problems, designing second-order optimizers to train deep neural networks (DNNs) has attracted much research interest in recent years. However, because of the very high dimension of intermediate features in DNNs, it is difficult to directly compute and store the Hessian matrix for network optimization. Most of the previous second-order methods approximate the Hessian information imprecisely, resulting in unstable performance. In this work, we propose a compound optimizer, which is a combination of a second-order optimizer with a precise partial Hessian matrix for updating channel-wise parameters and the first-order stochastic gradient descent (SGD) optimizer for updating the other parameters. We show that the associated Hessian matrices of channel-wise parameters are diagonal and can be extracted directly and precisely from Hessian-free methods. The proposed method, namely SGD with Partial Hessian (SGD-PH), inherits the advantages of both first-order and second-order optimizers. Compared with first-order optimizers, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order optimizers, it keeps the good generalization performance of first-order optimizers. Experiments on image classification tasks demonstrate the effectiveness of our proposed optimizer SGD-PH. The code is publicly available at \url{https://github.com/myingysun/SGDPH}.

SGD with Partial Hessian for Deep Neural Networks Optimization

TL;DR

SGD-PH addresses the practical challenge of leveraging second-order information in deep neural network optimization by exploiting a diagonal Hessian structure for channel-wise 1D parameters (e.g., BN gamma/beta and WN gamma) and applying precise Newton-type updates to these parameters, while updating all other parameters with a standard first-order SGD. The method uses Hessian-free backpropagation to extract the diagonal partial Hessian, along with rectification and momentum techniques to handle nonconvexity and stochasticity. Empirical results across CIFAR-10/100, Mini-ImageNet, and ImageNet show SGD-PH often outperforms both first-order and some second-order optimizers, improves generalization on deeper models, and remains effective in networks without BN by leveraging weight normalization. However, the approach incurs higher time and memory costs than pure SGD, highlighting a trade-off between optimization quality and computational resources. Overall, SGD-PH demonstrates that selective, precise second-order information can enhance optimization in deep learning without sacrificing generalization, offering a practical pathway for incorporating second-order cues in large-scale networks.

Abstract

Due to the effectiveness of second-order algorithms in solving classical optimization problems, designing second-order optimizers to train deep neural networks (DNNs) has attracted much research interest in recent years. However, because of the very high dimension of intermediate features in DNNs, it is difficult to directly compute and store the Hessian matrix for network optimization. Most of the previous second-order methods approximate the Hessian information imprecisely, resulting in unstable performance. In this work, we propose a compound optimizer, which is a combination of a second-order optimizer with a precise partial Hessian matrix for updating channel-wise parameters and the first-order stochastic gradient descent (SGD) optimizer for updating the other parameters. We show that the associated Hessian matrices of channel-wise parameters are diagonal and can be extracted directly and precisely from Hessian-free methods. The proposed method, namely SGD with Partial Hessian (SGD-PH), inherits the advantages of both first-order and second-order optimizers. Compared with first-order optimizers, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order optimizers, it keeps the good generalization performance of first-order optimizers. Experiments on image classification tasks demonstrate the effectiveness of our proposed optimizer SGD-PH. The code is publicly available at \url{https://github.com/myingysun/SGDPH}.
Paper Structure (23 sections, 8 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of SGD-PH. Here, Figure (a) represents the case when BN is adopted in neural networks training, while other normalization methods such as LN, GN and IN can be represented in the same way. Figure (b) illustrates the case of decoupling the convolutional layers when there are no normalization layers followed, where $\boldsymbol{\beta}$ represents bias, please see Section \ref{['sec_general_conv']} for more details.
  • Figure 2: Illustration of diagonal Hessian computation. Here, the light green boxes represent the elements that can be any real numbers while the white boxes represent zeros. The central $3\times 3$ matrix (i.e., ${\bf{H}}_{SO}$) in ${\bf{H}}$ is diagonal corresponding to the specific 1D variable. By multiplies with the vector ${\bf{e}}_{SO}$, we can extract the diagonal elements precisely in the middle of ${\bf{H}}{\bf{e}}_{SO}$ and compute the element-wise inverse to get ${\bf{D}}_{SO}$, which is exactly the diagonal of ${\bf{H}}^{-1}_{SO}$.
  • Figure 3: Testing accuracy curves of different optimizers for different DNNs on CIFAR100 and CIFAR10 datasets.
  • Figure 4: Testing accuracy curves of different optimizers on Mini-ImageNet dataset.
  • Figure 5: Testing and training accuracy curves of VGG19 without BN layers.
  • ...and 2 more figures