Effects of Initialization Biases on Deep Neural Network Training Dynamics
Nicholas Pellegrino, David Szczecina, Paul W. Fieguth
TL;DR
The paper investigates how Initial Guessing Bias (IGB), an architecture-dependent skew in class predictions immediately after random initialization, shapes the early training dynamics of deep networks. It compares Cross-Entropy (CE) with two robustness-focused losses, Blurry (BL) and Piecewise-Zero (PZ), on a ResNet-50 trained on CIFAR-10, monitoring per-class probabilities and accuracies during the first epoch. The findings show that IGB drives an initial dominance of a subset of classes, with CE and BL gradually rebalancing toward more uniform class performance, while PZ constrains learning to the initially favored class, stalling learning for others. The work highlights the critical role of initialization biases in determining early optimization behavior and cautions that robust losses may hinder learning if they suppress gradients from underrepresented classes, informing loss design and training pipeline choices in practice.
Abstract
Untrained large neural networks, just after random initialization, tend to favour a small subset of classes, assigning high predicted probabilities to these few classes and approximately zero probability to all others. This bias, termed Initial Guessing Bias, affects the early training dynamics, when the model is fitting to the coarse structure of the data. The choice of loss function against which to train the model has a large impact on how these early dynamics play out. Two recent loss functions, Blurry and Piecewise-zero loss, were designed for robustness to label errors but can become unable to steer the direction of training when exposed to this initial bias. Results indicate that the choice of loss function has a dramatic effect on the early phase training of networks, and highlights the need for careful consideration of how Initial Guessing Bias may interact with various components of the training scheme.
