Table of Contents
Fetching ...

Parareal Neural Networks Emulating a Parallel-in-time Algorithm

Chang-Ock Lee, Youngkyu Lee, Jongho Park

TL;DR

This work introduces parareal neural networks, a parallel-in-time-inspired framework that splits deep networks into parallel subnetworks connected by a coarse corrective network. By emulating the parareal algorithm, the method enables multi-GPU training with reduced inter-GPU communication while preserving or enhancing accuracy. Consistency is shown in a linear setting, and empirical results on VGG-16 and ResNet-1001 across datasets including ImageNet and CIFAR demonstrate competitive performance and favorable training-time characteristics. The approach offers a new avenue for scalable, memory-efficient parallelism in very deep networks by leveraging coarse corrections to propagate residuals across interface layers.

Abstract

As deep neural networks (DNNs) become deeper, the training time increases. In this perspective, multi-GPU parallel computing has become a key tool in accelerating the training of DNNs. In this paper, we introduce a novel methodology to construct a parallel neural network that can utilize multiple GPUs simultaneously from a given DNN. We observe that layers of DNN can be interpreted as the time step of a time-dependent problem and can be parallelized by emulating a parallel-in-time algorithm called parareal. The parareal algorithm consists of fine structures which can be implemented in parallel and a coarse structure which gives suitable approximations to the fine structures. By emulating it, the layers of DNN are torn to form a parallel structure which is connected using a suitable coarse network. We report accelerated and accuracy-preserved results of the proposed methodology applied to VGG-16 and ResNet-1001 on several datasets.

Parareal Neural Networks Emulating a Parallel-in-time Algorithm

TL;DR

This work introduces parareal neural networks, a parallel-in-time-inspired framework that splits deep networks into parallel subnetworks connected by a coarse corrective network. By emulating the parareal algorithm, the method enables multi-GPU training with reduced inter-GPU communication while preserving or enhancing accuracy. Consistency is shown in a linear setting, and empirical results on VGG-16 and ResNet-1001 across datasets including ImageNet and CIFAR demonstrate competitive performance and favorable training-time characteristics. The approach offers a new avenue for scalable, memory-efficient parallelism in very deep networks by leveraging coarse corrections to propagate residuals across interface layers.

Abstract

As deep neural networks (DNNs) become deeper, the training time increases. In this perspective, multi-GPU parallel computing has become a key tool in accelerating the training of DNNs. In this paper, we introduce a novel methodology to construct a parallel neural network that can utilize multiple GPUs simultaneously from a given DNN. We observe that layers of DNN can be interpreted as the time step of a time-dependent problem and can be parallelized by emulating a parallel-in-time algorithm called parareal. The parareal algorithm consists of fine structures which can be implemented in parallel and a coarse structure which gives suitable approximations to the fine structures. By emulating it, the layers of DNN are torn to form a parallel structure which is connected using a suitable coarse network. We report accelerated and accuracy-preserved results of the proposed methodology applied to VGG-16 and ResNet-1001 on several datasets.

Paper Structure

This paper contains 13 sections, 1 theorem, 29 equations, 5 figures, 9 tables, 2 algorithms.

Key Result

Proposition 3.1

\newlabelProp:equiv0 Assume that the original network $f_{\theta}$ is linear and $F_{\eta_j}^j = g_{\phi_{j+1}}^{j+1}$ for $j= 1, \dots, N-1$. Then we have $\bar{f}_{\bar{\theta}} (\mathbf{x}) = f_{\theta} (\mathbf{x})$ for all $\mathbf{x} \in X$.

Figures (5)

  • Figure 1: Fine and coarse propagations in the parareal algorithm: the red lines which propagate from $T_j$ to $T_{j+1}$ represent \ref{['eq_ode2']} and the blue curves which directly connect $T_j$ and $T_{j+1}$ represent \ref{['eq_ode3']}.
  • Figure 1: A feed-forward neural network and its corresponding parareal neural network: (a) Feed-forward neural network $f_{\theta}$, (b) Parareal neural network $\bar{f}_{\bar{\theta}}$ with $N$ parallel subnetworks ($N=3$).
  • Figure 1: Bottleneck structure of an RU used in ResNet-1001. The first $1 \times 1$ convolution squeezes the number of channels of an input and the last $1 \times 1$ convolution increases the number of channels of an intermediate result. If $\mathbf{x}_l$ and $\mathbf{x}_{l+1}$ have different numbers of channels, then a $1 \times 1$ convolution is added to the skip connection in order to match the number of channels.
  • Figure 2: Comparison of the training loss for Parareal ResNet-$N$ ($N=1,3,6,12,18$) on various datasets: (a) CIFAR-10, (b) CIFAR-100, (c) MNIST, and (d) SVHN results.
  • Figure 3: Structure of the $K$-warp operator, $K=2$: (a) Two RUs $f_{\theta_1}^1$ and $f_{\theta_2}^2$ in ResNet. (b) The corresponding $2$-warp operator in WarpNet.

Theorems & Definitions (4)

  • Proposition 3.1: Consistency
  • Proof 1
  • Remark 4.1
  • Remark 4.2