Table of Contents
Fetching ...

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

Juyoung Yun, Sol Choi, Francois Rameau, Byungkon Kang, Zhoulai Fu

TL;DR

This work investigates whether standalone IEEE 16-bit floating-point training can match 32-bit and mixed-precision training under hardware constraints. It develops a theoretical framework based on floating-point error and classification tolerance, proving a lemma that guarantees identical classifications when $\\Gamma(M_{32},x) \geq 2 \, \\delta(M_{32}, M_{16}, x)$. Empirically, it validates on MNIST and CIFAR-10 across CNNs and ViTs, finding that FP16 achieves about a $1.6\times$ speedup over FP32 and $1.3\times$ over MP with only ~0.2% average accuracy loss, and occasional instability can be mitigated by repeated runs. The findings suggest FP16 training is a practical, accessible option for resource-limited ML workloads, reducing reliance on newer low-precision hardware such as FP8/FP4.

Abstract

With the increasing complexity of machine learning models, managing computational resources like memory and processing power has become a critical concern. Mixed precision techniques, which leverage different numerical precisions during model training and inference to optimize resource usage, have been widely adopted. However, access to hardware that supports lower precision formats (e.g., FP8 or FP4) remains limited, especially for practitioners with hardware constraints. For many with limited resources, the available options are restricted to using 32-bit, 16-bit, or a combination of the two. While it is commonly believed that 16-bit precision can achieve results comparable to full (32-bit) precision, this study is the first to systematically validate this assumption through both rigorous theoretical analysis and extensive empirical evaluation. Our theoretical formalization of floating-point errors and classification tolerance provides new insights into the conditions under which 16-bit precision can approximate 32-bit results. This study fills a critical gap, proving for the first time that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy while boosting computational speed. Given the widespread availability of 16-bit across GPUs, these findings are especially valuable for machine learning practitioners with limited hardware resources to make informed decisions.

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

TL;DR

This work investigates whether standalone IEEE 16-bit floating-point training can match 32-bit and mixed-precision training under hardware constraints. It develops a theoretical framework based on floating-point error and classification tolerance, proving a lemma that guarantees identical classifications when . Empirically, it validates on MNIST and CIFAR-10 across CNNs and ViTs, finding that FP16 achieves about a speedup over FP32 and over MP with only ~0.2% average accuracy loss, and occasional instability can be mitigated by repeated runs. The findings suggest FP16 training is a practical, accessible option for resource-limited ML workloads, reducing reliance on newer low-precision hardware such as FP8/FP4.

Abstract

With the increasing complexity of machine learning models, managing computational resources like memory and processing power has become a critical concern. Mixed precision techniques, which leverage different numerical precisions during model training and inference to optimize resource usage, have been widely adopted. However, access to hardware that supports lower precision formats (e.g., FP8 or FP4) remains limited, especially for practitioners with hardware constraints. For many with limited resources, the available options are restricted to using 32-bit, 16-bit, or a combination of the two. While it is commonly believed that 16-bit precision can achieve results comparable to full (32-bit) precision, this study is the first to systematically validate this assumption through both rigorous theoretical analysis and extensive empirical evaluation. Our theoretical formalization of floating-point errors and classification tolerance provides new insights into the conditions under which 16-bit precision can approximate 32-bit results. This study fills a critical gap, proving for the first time that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy while boosting computational speed. Given the widespread availability of 16-bit across GPUs, these findings are especially valuable for machine learning practitioners with limited hardware resources to make informed decisions.
Paper Structure (16 sections, 1 theorem, 6 equations, 4 figures, 3 tables)

This paper contains 16 sections, 1 theorem, 6 equations, 4 figures, 3 tables.

Key Result

lemma thmcounterlemma

Let $\mathrm{class}(M_r, x)$ denote the classification result of classifiers $M_r$ for $r\in\{16,32\}$ on a sample $x\in \mathcal{X}$. Namely, $\mathrm{class}(M_r, x)\mathrel{\stackrel{\hbox{\normalfont\tiny def}}{=}}\mathrm{argmax}_{i}\{p_i\vert p_i\in M_r(x)\}$. We have: If then $\mathrm{class}(M_{32}, x)=\mathrm{class}(M_{16}, x)$.

Figures (4)

  • Figure 1: Floating-point Errors vs. Classification Tolerance: Mean $\pm$ standard deviation
  • Figure 2: DNN Accuracies on MNIST Dataset: 32-bit vs. 16-bit floating-point
  • Figure 3: Comparative test accuracy over 100 epochs on CNNs and Vision Transformer (ViT) architectures utilizing IEEE 16-bit, 32-bit, and mixed precision.
  • Figure 4: Boxplot of Test Accuracy: This figure illustrates the performance of CNN models and the Vision Transformer across three floating-point precisions: IEEE 16-bit, 32-bit, and mixed precision. Results from 50 random seeds are included to ensure unbiased representation. The white lines indicate the medians, while the white dots represent outliers.

Theorems & Definitions (4)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof