Table of Contents
Fetching ...

Better Schedules for Low Precision Training of Deep Neural Networks

Cameron R. Wolfe, Anastasios Kyrillidis

TL;DR

This work analyzes cyclic precision training (CPT) as a dynamic low-precision scheme to reduce DNN trainingCompute while preserving or improving performance. It formalizes CPT as a three-step design space—profile, number of cycles $n$, and repeated/triangular forms—and defines $q_t = \text{round}(S(t))$ with bounds $q_{\min}$ and $q_{\max}$, evaluating ten schedules (including the original CR) across CNNs, RNNs, Transformers, and GNNs. Empirically, many CPT variants outperform static low-precision baselines, and a robust correlation emerges between reduced training cost and preserved or enhanced accuracy, though aggressive quantization can harm large-scale tasks. The authors connect low-precision training to critical learning periods, showing that impairments during early training can yield permanent performance losses, and they derive practical best practices to choose CPT schedules tailored to domain, model size, and compute budgets. Overall, the paper broadens the CPT design space (notably applying it to GNNs) and provides actionable guidance for balancing training efficiency with model performance in real-world settings.

Abstract

Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.

Better Schedules for Low Precision Training of Deep Neural Networks

TL;DR

This work analyzes cyclic precision training (CPT) as a dynamic low-precision scheme to reduce DNN trainingCompute while preserving or improving performance. It formalizes CPT as a three-step design space—profile, number of cycles , and repeated/triangular forms—and defines with bounds and , evaluating ten schedules (including the original CR) across CNNs, RNNs, Transformers, and GNNs. Empirically, many CPT variants outperform static low-precision baselines, and a robust correlation emerges between reduced training cost and preserved or enhanced accuracy, though aggressive quantization can harm large-scale tasks. The authors connect low-precision training to critical learning periods, showing that impairments during early training can yield permanent performance losses, and they derive practical best practices to choose CPT schedules tailored to domain, model size, and compute budgets. Overall, the paper broadens the CPT design space (notably applying it to GNNs) and provides actionable guidance for balancing training efficiency with model performance in real-world settings.

Abstract

Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.
Paper Structure (14 sections, 2 equations, 8 figures, 1 table)

This paper contains 14 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A depiction of quantized forward/backward pass within a single DNN layer.
  • Figure 2: An illustration of profiles and schedules for CPT over $T$ total training iterations. Function profiles are depicted in the upper-left subplot, while the lower-left subplot illustrates the CR schedule with different numbers of cycles $n$. Remaining subplots depict all possible CPT schedules explored in this work---both with and without rounding to the nearest integer---for $n=2$ cycles.
  • Figure 3: Results of CPT experiments on CIFAR-10/100 and ImageNet. Colors represent profiles, while shapes distinguish repeated or triangular schedules. Experiments are run with $q_{\text{max}} \in \{6, 8\}$, distinguished by a dark outline around a shape. Future figures adopt the same scheme of colors and shapes.
  • Figure 4: Results of CPT experiments on PascaVOC. The same coloring scheme is adopted from Figure \ref{['fig:img_classif_res']}.
  • Figure 5: Validation accuracy of GNN and GraphSAGE models trained on OGBN-Arxiv and OGBN-Products using $\texttt{Q-Agg}$ or $\texttt{FP-Agg}$ and $q_{\text{max}} = q_t = 8$.
  • ...and 3 more figures