Table of Contents
Fetching ...

Effects of Initialization Biases on Deep Neural Network Training Dynamics

Nicholas Pellegrino, David Szczecina, Paul W. Fieguth

TL;DR

The paper investigates how Initial Guessing Bias (IGB), an architecture-dependent skew in class predictions immediately after random initialization, shapes the early training dynamics of deep networks. It compares Cross-Entropy (CE) with two robustness-focused losses, Blurry (BL) and Piecewise-Zero (PZ), on a ResNet-50 trained on CIFAR-10, monitoring per-class probabilities and accuracies during the first epoch. The findings show that IGB drives an initial dominance of a subset of classes, with CE and BL gradually rebalancing toward more uniform class performance, while PZ constrains learning to the initially favored class, stalling learning for others. The work highlights the critical role of initialization biases in determining early optimization behavior and cautions that robust losses may hinder learning if they suppress gradients from underrepresented classes, informing loss design and training pipeline choices in practice.

Abstract

Untrained large neural networks, just after random initialization, tend to favour a small subset of classes, assigning high predicted probabilities to these few classes and approximately zero probability to all others. This bias, termed Initial Guessing Bias, affects the early training dynamics, when the model is fitting to the coarse structure of the data. The choice of loss function against which to train the model has a large impact on how these early dynamics play out. Two recent loss functions, Blurry and Piecewise-zero loss, were designed for robustness to label errors but can become unable to steer the direction of training when exposed to this initial bias. Results indicate that the choice of loss function has a dramatic effect on the early phase training of networks, and highlights the need for careful consideration of how Initial Guessing Bias may interact with various components of the training scheme.

Effects of Initialization Biases on Deep Neural Network Training Dynamics

TL;DR

The paper investigates how Initial Guessing Bias (IGB), an architecture-dependent skew in class predictions immediately after random initialization, shapes the early training dynamics of deep networks. It compares Cross-Entropy (CE) with two robustness-focused losses, Blurry (BL) and Piecewise-Zero (PZ), on a ResNet-50 trained on CIFAR-10, monitoring per-class probabilities and accuracies during the first epoch. The findings show that IGB drives an initial dominance of a subset of classes, with CE and BL gradually rebalancing toward more uniform class performance, while PZ constrains learning to the initially favored class, stalling learning for others. The work highlights the critical role of initialization biases in determining early optimization behavior and cautions that robust losses may hinder learning if they suppress gradients from underrepresented classes, informing loss design and training pipeline choices in practice.

Abstract

Untrained large neural networks, just after random initialization, tend to favour a small subset of classes, assigning high predicted probabilities to these few classes and approximately zero probability to all others. This bias, termed Initial Guessing Bias, affects the early training dynamics, when the model is fitting to the coarse structure of the data. The choice of loss function against which to train the model has a large impact on how these early dynamics play out. Two recent loss functions, Blurry and Piecewise-zero loss, were designed for robustness to label errors but can become unable to steer the direction of training when exposed to this initial bias. Results indicate that the choice of loss function has a dramatic effect on the early phase training of networks, and highlights the need for careful consideration of how Initial Guessing Bias may interact with various components of the training scheme.

Paper Structure

This paper contains 8 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Predicted probability for each class, averaged over the validation set, directly after model initialization. Observe that one class (class 0, randomly) is highly favoured ($\bar{p}_0\approx1$) relative to other classes ($\bar{p}_y\approx0$ for $y\ne0$) as a result of severe Initial Guessing Bias.
  • Figure 2: Training dynamics for three choices of loss function: Cross-Entropy (CE; top row), Blurry Loss (BL; second row), and Piecewise-zero Loss (PZ; bottom row). Averaged Softmax probabilities (left column) and per-class accuracy (right column) are shown throughout training. Training duration is plotted on a log-scale, starting at 0 (before training) and showing fractional parts of the first epoch (batches). At the outset, the IGB effect causes probabilities and accuracies to be highly distinct between favoured and unfavoured classes for all loss functions (see \ref{['fig:mean_pred_probs']}); however, as training progresses differences emerge. In all cases, predicted probabilities move towards each other during the first epoch and converge roughly to $p_y=0.1$ (approximating the class distribution of the dataset). After the first epoch, differences resulting from the choice of loss function reveal themselves. For Cross-Entropy, the predicted probabilities move as a group, gradually rising, with corresponding rises in per-class accuracies. Blurry loss behaves fairly similarly to Cross-Enropy, but with slightly slower dynamics and with accuracies lagging. For Piecewise-zero loss, predicted probability for the originally favoured class remains above all others, and again rises, while all others fall. In this case, the resulting per-class accuracies remain largely unchanged.
  • Figure :