SGD with Partial Hessian for Deep Neural Networks Optimization
Ying Sun, Hongwei Yong, Lei Zhang
TL;DR
SGD-PH addresses the practical challenge of leveraging second-order information in deep neural network optimization by exploiting a diagonal Hessian structure for channel-wise 1D parameters (e.g., BN gamma/beta and WN gamma) and applying precise Newton-type updates to these parameters, while updating all other parameters with a standard first-order SGD. The method uses Hessian-free backpropagation to extract the diagonal partial Hessian, along with rectification and momentum techniques to handle nonconvexity and stochasticity. Empirical results across CIFAR-10/100, Mini-ImageNet, and ImageNet show SGD-PH often outperforms both first-order and some second-order optimizers, improves generalization on deeper models, and remains effective in networks without BN by leveraging weight normalization. However, the approach incurs higher time and memory costs than pure SGD, highlighting a trade-off between optimization quality and computational resources. Overall, SGD-PH demonstrates that selective, precise second-order information can enhance optimization in deep learning without sacrificing generalization, offering a practical pathway for incorporating second-order cues in large-scale networks.
Abstract
Due to the effectiveness of second-order algorithms in solving classical optimization problems, designing second-order optimizers to train deep neural networks (DNNs) has attracted much research interest in recent years. However, because of the very high dimension of intermediate features in DNNs, it is difficult to directly compute and store the Hessian matrix for network optimization. Most of the previous second-order methods approximate the Hessian information imprecisely, resulting in unstable performance. In this work, we propose a compound optimizer, which is a combination of a second-order optimizer with a precise partial Hessian matrix for updating channel-wise parameters and the first-order stochastic gradient descent (SGD) optimizer for updating the other parameters. We show that the associated Hessian matrices of channel-wise parameters are diagonal and can be extracted directly and precisely from Hessian-free methods. The proposed method, namely SGD with Partial Hessian (SGD-PH), inherits the advantages of both first-order and second-order optimizers. Compared with first-order optimizers, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order optimizers, it keeps the good generalization performance of first-order optimizers. Experiments on image classification tasks demonstrate the effectiveness of our proposed optimizer SGD-PH. The code is publicly available at \url{https://github.com/myingysun/SGDPH}.
