Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

Juyoung Yun; Sol Choi; Francois Rameau; Byungkon Kang; Zhoulai Fu

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

Juyoung Yun, Sol Choi, Francois Rameau, Byungkon Kang, Zhoulai Fu

TL;DR

This work investigates whether standalone IEEE 16-bit floating-point training can match 32-bit and mixed-precision training under hardware constraints. It develops a theoretical framework based on floating-point error and classification tolerance, proving a lemma that guarantees identical classifications when $\\Gamma(M_{32},x) \geq 2 \, \\delta(M_{32}, M_{16}, x)$. Empirically, it validates on MNIST and CIFAR-10 across CNNs and ViTs, finding that FP16 achieves about a $1.6\times$ speedup over FP32 and $1.3\times$ over MP with only ~0.2% average accuracy loss, and occasional instability can be mitigated by repeated runs. The findings suggest FP16 training is a practical, accessible option for resource-limited ML workloads, reducing reliance on newer low-precision hardware such as FP8/FP4.

Abstract

With the increasing complexity of machine learning models, managing computational resources like memory and processing power has become a critical concern. Mixed precision techniques, which leverage different numerical precisions during model training and inference to optimize resource usage, have been widely adopted. However, access to hardware that supports lower precision formats (e.g., FP8 or FP4) remains limited, especially for practitioners with hardware constraints. For many with limited resources, the available options are restricted to using 32-bit, 16-bit, or a combination of the two. While it is commonly believed that 16-bit precision can achieve results comparable to full (32-bit) precision, this study is the first to systematically validate this assumption through both rigorous theoretical analysis and extensive empirical evaluation. Our theoretical formalization of floating-point errors and classification tolerance provides new insights into the conditions under which 16-bit precision can approximate 32-bit results. This study fills a critical gap, proving for the first time that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy while boosting computational speed. Given the widespread availability of 16-bit across GPUs, these findings are especially valuable for machine learning practitioners with limited hardware resources to make informed decisions.

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

TL;DR

. Empirically, it validates on MNIST and CIFAR-10 across CNNs and ViTs, finding that FP16 achieves about a

speedup over FP32 and

over MP with only ~0.2% average accuracy loss, and occasional instability can be mitigated by repeated runs. The findings suggest FP16 training is a practical, accessible option for resource-limited ML workloads, reducing reliance on newer low-precision hardware such as FP8/FP4.

Abstract

Paper Structure (16 sections, 1 theorem, 6 equations, 4 figures, 3 tables)

This paper contains 16 sections, 1 theorem, 6 equations, 4 figures, 3 tables.

Introduction
Related Work
Theoretical Analysis
Background: Floating-Point Error
Theory: Error vs. Tolerance
Notation.
Explanation of the lemma.
Observation: Standalone IEEE 16-bit Floating-Point DNN on MNIST
Remark.
Experimental Results
Precisions
Experimental Settings
Results
Discussion: Numerical Stability and Hyperparameter Tuning
Conclusion
...and 1 more sections

Key Result

lemma thmcounterlemma

Let $\mathrm{class}(M_r, x)$ denote the classification result of classifiers $M_r$ for $r\in\{16,32\}$ on a sample $x\in \mathcal{X}$. Namely, $\mathrm{class}(M_r, x)\mathrel{\stackrel{\hbox{\normalfont\tiny def}}{=}}\mathrm{argmax}_{i}\{p_i\vert p_i\in M_r(x)\}$. We have: If then $\mathrm{class}(M_{32}, x)=\mathrm{class}(M_{16}, x)$.

Figures (4)

Figure 1: Floating-point Errors vs. Classification Tolerance: Mean $\pm$ standard deviation
Figure 2: DNN Accuracies on MNIST Dataset: 32-bit vs. 16-bit floating-point
Figure 3: Comparative test accuracy over 100 epochs on CNNs and Vision Transformer (ViT) architectures utilizing IEEE 16-bit, 32-bit, and mixed precision.
Figure 4: Boxplot of Test Accuracy: This figure illustrates the performance of CNN models and the Vision Transformer across three floating-point precisions: IEEE 16-bit, 32-bit, and mixed precision. Results from 50 random seeds are included to ensure unbiased representation. The white lines indicate the medians, while the white dots represent outliers.

Theorems & Definitions (4)

definition thmcounterdefinition
definition thmcounterdefinition
lemma thmcounterlemma
proof

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

TL;DR

Abstract

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (4)