Speeding up and reducing memory usage for scientific machine learning via mixed precision

Joel Hayford; Jacob Goldman-Wetzler; Eric Wang; Lu Lu

Speeding up and reducing memory usage for scientific machine learning via mixed precision

Joel Hayford, Jacob Goldman-Wetzler, Eric Wang, Lu Lu

TL;DR

Mixed precision is explored, which is an approach that combines the float16 and float32 numerical formats to reduce memory usage and increase computational speed and has broad implications for SciML in various computational applications.

Abstract

Scientific machine learning (SciML) has emerged as a versatile approach to address complex computational science and engineering problems. Within this field, physics-informed neural networks (PINNs) and deep operator networks (DeepONets) stand out as the leading techniques for solving partial differential equations by incorporating both physical equations and experimental data. However, training PINNs and DeepONets requires significant computational resources, including long computational times and large amounts of memory. In search of computational efficiency, training neural networks using half precision (float16) rather than the conventional single (float32) or double (float64) precision has gained substantial interest, given the inherent benefits of reduced computational time and memory consumed. However, we find that float16 cannot be applied to SciML methods, because of gradient divergence at the start of training, weight updates going to zero, and the inability to converge to a local minima. To overcome these limitations, we explore mixed precision, which is an approach that combines the float16 and float32 numerical formats to reduce memory usage and increase computational speed. Our experiments showcase that mixed precision training not only substantially decreases training times and memory demands but also maintains model accuracy. We also reinforce our empirical observations with a theoretical analysis. The research has broad implications for SciML in various computational applications.

Speeding up and reducing memory usage for scientific machine learning via mixed precision

TL;DR

Abstract

Paper Structure (38 sections, 2 theorems, 43 equations, 7 figures, 12 tables)

This paper contains 38 sections, 2 theorems, 43 equations, 7 figures, 12 tables.

Keywords:
Introduction
Methods in scientific machine learning for PDEs
Physics-informed neural networks
Operator learning with DeepONets
Failure of scientific machine learning with float16
Function regression
Problem setup and total error of float16
Low approximation error for float16
Training difficulty and large optimization error for float16
Loss gradients at network initialization step.
Training loss, loss gradients, and network weights during network training.
Loss landscape during network training.
Two training phases for understanding the failure of float16.
PINNs
...and 23 more sections

Key Result

Theorem 1

The mixed precision loss function will reach some $\theta$ by gradient descent, such that

Figures (7)

Figure 1: Comparison of float32 and float16 for the function regression in Sec. \ref{['Function regression']}. Train a neural network using float32, and then cast the network weights and biases from float32 to float16 after training.
Figure 2: Comparison of training the networks of float16, float32, and mixed precision. Both networks are initialized with the same weights and biases. (A) The training losses for float16 and float32 networks. (B) $L^2$ norm of the gradients of the training loss with respect to the network’s parameters. The loss gradients achieve much smaller values for float32 than float16. (C) The percentage of network weights that remain constant during training. The curves and shaded regions represent the mean and one standard deviation of $10$ runs. Almost all weights become constant for the float16 network. (D) Loss landscapes at different iterations between float32, float16, and mixed precision networks.
Figure 3: Comparison of PINN predictions and loss landscapes at different iterations for the heat equation (Sec. \ref{['heatequation']}). (A) Train a PINN using float32, and then cast the network weights and biases from float32 to float16 after training. (B) The local loss landscapes of two networks at different iterations. The change in loss landscape for float32 network from iteration 0 to iteration 1 is smooth, while the change in loss landscape for float16 network from iteration 0 to iteration 1 is not smooth and the loss landscape has regions of NaNs.
Figure 4: Flowchart of training networks with mixed precision. The dashed path represents the optional loss scaling technique with a scale factor. Float32 and float16 are abbreviated as f32 and f16, respectively.
Figure 5: Reference solutions and point-wise absolute error of velocity for the Kovasznay flow problem. (First row) Reference solution. (Second row) The network prediction of float16 has large error. (Third and fourth rows) The network predictions of float32 and mixed precision have low error.
...and 2 more figures

Theorems & Definitions (6)

Theorem 1
proof
Corollary 1.1
proof
proof
proof

Speeding up and reducing memory usage for scientific machine learning via mixed precision

TL;DR

Abstract

Speeding up and reducing memory usage for scientific machine learning via mixed precision

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (6)