Table of Contents
Fetching ...

A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training

Chengcheng Yan, Jiawei Xu, Qingsong Wang, Zheng Peng

TL;DR

TIAM addresses the slow convergence and limited guarantees of SGD and AM methods in training multilayer perceptrons by introducing a Triple-Inertial Accelerated Alternating Minimization framework that uses a specialized backtracking-based approximation to avoid costly matrix inversions while accelerating updates. The method provides global convergence guarantees and a linear convergence rate under mild assumptions, and it demonstrates superior generalization and computational efficiency on four datasets with a network of $L=3$ layers and $n_h=100$ hidden units per layer. The experiments show robust performance across ReLU variants and substantiate the practical impact of layer-wise triple inertia for faster training. The work also suggests TIAM as a flexible framework that can be extended to stochastic mini-batch training in future research.

Abstract

The stochastic gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models. However, it has several limitations, including susceptibility to vanishing gradients, sensitivity to input data, and a lack of robust theoretical guarantees. In recent years, alternating minimization (AM) methods have emerged as a promising alternative for model training by employing gradient-free approaches to iteratively update model parameters. Despite their potential, these methods often exhibit slow convergence rates. To address this challenge, we propose a novel Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for neural network training. The TIAM approach incorporates a triple-inertial acceleration strategy with a specialized approximation method, facilitating targeted acceleration of different terms in each sub-problem optimization. This integration improves the efficiency of convergence, achieving superior performance with fewer iterations. Additionally, we provide a convergence analysis of the TIAM algorithm, including its global convergence properties and convergence rate. Extensive experiments validate the effectiveness of the TIAM method, showing significant improvements in generalization capability and computational efficiency compared to existing approaches, particularly when applied to the rectified linear unit (ReLU) and its variants.

A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training

TL;DR

TIAM addresses the slow convergence and limited guarantees of SGD and AM methods in training multilayer perceptrons by introducing a Triple-Inertial Accelerated Alternating Minimization framework that uses a specialized backtracking-based approximation to avoid costly matrix inversions while accelerating updates. The method provides global convergence guarantees and a linear convergence rate under mild assumptions, and it demonstrates superior generalization and computational efficiency on four datasets with a network of layers and hidden units per layer. The experiments show robust performance across ReLU variants and substantiate the practical impact of layer-wise triple inertia for faster training. The work also suggests TIAM as a flexible framework that can be extended to stochastic mini-batch training in future research.

Abstract

The stochastic gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models. However, it has several limitations, including susceptibility to vanishing gradients, sensitivity to input data, and a lack of robust theoretical guarantees. In recent years, alternating minimization (AM) methods have emerged as a promising alternative for model training by employing gradient-free approaches to iteratively update model parameters. Despite their potential, these methods often exhibit slow convergence rates. To address this challenge, we propose a novel Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for neural network training. The TIAM approach incorporates a triple-inertial acceleration strategy with a specialized approximation method, facilitating targeted acceleration of different terms in each sub-problem optimization. This integration improves the efficiency of convergence, achieving superior performance with fewer iterations. Additionally, we provide a convergence analysis of the TIAM algorithm, including its global convergence properties and convergence rate. Extensive experiments validate the effectiveness of the TIAM method, showing significant improvements in generalization capability and computational efficiency compared to existing approaches, particularly when applied to the rectified linear unit (ReLU) and its variants.

Paper Structure

This paper contains 26 sections, 8 theorems, 81 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Lemma 1

There exist $\beta_l,\delta_l,\gamma_l,\zeta_l>0$ for $k\in\mathbb{N}$ such that it holds that

Figures (7)

  • Figure 1: The figure illustrates the update procedure for the network parameter $b^{k}$ using the triple-inertial acceleration method. $b^{k+1}$ is computed via a conventional gradient descent update within the AM framework. Specifically, $\tilde{b}^{k+1}$ and $\hat{b}^{k+1}$ are derived from the first and second acceleration steps, respectively. Additionally, $\overline{a}^{k}$, $\overline{z}^{k}$, and $\overline{W}^{k+1}$ correspond to the parameter values obtained after the third acceleration step. The proposed framework is highly flexible, allowing seamless incorporation of various inertial acceleration strategies into existing optimization algorithms by effectively leveraging a sufficient amount of historical iterative information.
  • Figure 2: The curve illustrates the mean and standard deviation of test accuracy for all methods over 10 runs. The proposed algorithm consistently outperforms all other comparison methods for the four datasets.
  • Figure 3: The relationship between test accuracy and runtime for the AM methods in the four datasets.
  • Figure 4: The curve illustrates the mean and standard deviation of the objective function values achieved by the proposed method over 10 runs, utilizing the ReLU activation function and its variants.
  • Figure 5: The curve illustrates the mean and standard deviation of test accuracy for all methods over 10 runs, utilizing the LeakyReLU activation function.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Remark 1
  • Remark 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 1
  • Theorem 2
  • Lemma 5
  • Lemma 6