Table of Contents
Fetching ...

Initial Guessing Bias: How Untrained Networks Favor Some Classes

Emanuele Francazi, Aurelien Lucchi, Marco Baity-Jesi

TL;DR

It is proved that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training, and in the absence of explicit biases, which is called Initial Guessing Bias (IGB).

Abstract

Understanding and controlling biasing effects in neural networks is crucial for ensuring accurate and fair model performance. In the context of classification problems, we provide a theoretical analysis demonstrating that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training, and in the absence of explicit biases. We prove that, besides dataset properties, the presence of this phenomenon, which we call \textit{Initial Guessing Bias} (IGB), is influenced by model choices including dataset preprocessing methods, and architectural decisions, such as activation functions, max-pooling layers, and network depth. Our analysis of IGB provides information for architecture selection and model initialization. We also highlight theoretical consequences, such as the breakdown of node-permutation symmetry, the violation of self-averaging and the non-trivial effects that depth has on the phenomenon.

Initial Guessing Bias: How Untrained Networks Favor Some Classes

TL;DR

It is proved that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training, and in the absence of explicit biases, which is called Initial Guessing Bias (IGB).

Abstract

Understanding and controlling biasing effects in neural networks is crucial for ensuring accurate and fair model performance. In the context of classification problems, we provide a theoretical analysis demonstrating that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training, and in the absence of explicit biases. We prove that, besides dataset properties, the presence of this phenomenon, which we call \textit{Initial Guessing Bias} (IGB), is influenced by model choices including dataset preprocessing methods, and architectural decisions, such as activation functions, max-pooling layers, and network depth. Our analysis of IGB provides information for architecture selection and model initialization. We also highlight theoretical consequences, such as the breakdown of node-permutation symmetry, the violation of self-averaging and the non-trivial effects that depth has on the phenomenon.
Paper Structure (79 sections, 10 theorems, 167 equations, 18 figures)

This paper contains 79 sections, 10 theorems, 167 equations, 18 figures.

Key Result

Theorem 4.1

Consider a Gaussian distributed dataset processed through an MLP with $L$ hidden layers and weights initialized according to the Kaiming normal scheme (with null bias weights). In the limit of infinite width, the distribution of an output node $O_{c }$, at initialization, converges to: where $|\mathcal{W}^{}|$ indicates the cardinality of the set $\mathcal{W}^{}$ and, for compactness, we denoted

Figures (18)

  • Figure 1: Initial Guessing Bias (IGB). Consider a task where we classify a binary dataset using an untrained network. Does it assign half of the examples to each class, or does it privilege one class? The answer depends on the model design. In the top-left, we classify a binary dataset with an untrained network without IGB. This model will generally assign half of the examples to each class (histogram on the top-center). In the bottom-left, we classify the same dataset using an untrained network with IGB. In this case, most of the guesses will usually go to one of the two classes (histogram on the bottom-center). As an example, we take the dog/cat classes (label $0$ / label $1$) from CIFAR10, and pass them through an untrained CNN with 2 layers, each followed by pooling. The non-IGB model uses $\tanh$ activations and average pooling, the IGB model uses ReLU and max pooling. We show in the center-right the distribution over different initializations, $f_{G_{0}}^{} \left( {g_{0}} \right)$, of the fraction $G_{0}$ of times that each model guessed dog (equivalently, $G_{1}=1-G_{0}$ indicates the fraction of images guessed as cat). While for the non-IGB models, $G_{0}$ is most often 50%, with IGB it most often is either 0% or 100%.
  • Figure 2: Left: IGB and Performance Bound: Diagram of the accessible performance range conditioned on the behavior of $\max_c \{ G_{c} \left( \mathcal{W}^{}_t \right) \}$ in a balanced binary dataset, in accordance with Eq. \ref{['eq:acc_bound']}. Right: Comparison of the trend of $\max_c \{ G_{c} \left( \mathcal{W}^{}_t \right) \}$ with that of accuracy during the learning dynamics, varying with the level of guessing bias at initialization (IGB). Particularly, $\gamma (\mathcal{A}, \psi (\chi)) \in \mathbb{R}^+$ (colormap on the right) provides a measure of the level of IGB: the higher the value of $\gamma (\mathcal{A}, \psi (\chi))$, the higher the level of IGB (see Sec. \ref{['sec:BigPicture']} for more details). The curves show a consistent pattern with the diagram on the left. They also demonstrate that the time for IGB absorption increases with the level of IGB itself. The simulations were conducted on an MLP-mixer using a binary dataset (dogs vs cats from CIFAR) as input; more details on the setting and additional experiments with more architectures/datasets are provided in App. \ref{['app:dyn']}.
  • Figure 3: Illustration of the key quantities used in the analysis: 1) The green and purple curves represent the distributions of the two output nodes for a fixed set of network weights, $\mathcal{W}^{}$, and 2) the mean of the distributions are denoted by $\mu_{c}$.
  • Figure 4: Comparison of two extreme scenarios: no IGB on the left, and strong IGB on the right. If the centers of the distributions, $\mu_{c}$, have small fluctuations compared to the ones of the distributions $f_{O_{c }}^{(\chi)} \left( {o_{ }} \right)$, the two distributions almost completely overlap, resulting in a similar probability that one output node exceeds the other (left). If, instead, the centers are typically much further apart than the fluctuations scale of the distributions $f_{O_{c }}^{(\chi)} \left( {o_{ }} \right)$, the values drawn from one distribution exceed the other one with high probability (right). Each plot contains two inset plots. The inset plot in the upper left represents the distribution of the difference of the r.v.s shown in the main plot, ($\Delta_{O_{ }}$). Note that, fixing the set $\mathcal{W}^{}$ in a given experiment, and assuming a dataset big enough, \ref{['eq:f0_def_LLN']} holds (the probability mass of the r.h.s. is depicted with a red area bounded by the distribution and the integration extremes). The inset plot in the upper right shows instead $f_{\mu_{c}}^{} \left( {m_{}} \right)$ to give an idea of the fluctuations of $\mu_{c}$ for the two cases, measured also by the variances ratio reported above the inset plot.
  • Figure 5: The distribution $f_{G_{0}}^{} \left( {g_{0}} \right)$ in a single-hidden-layer perceptron, for different choices of activation functions and with/without max pooling.
  • ...and 13 more figures

Theorems & Definitions (20)

  • Definition 3.1: IGB, informal
  • Theorem 4.1: Informal
  • Definition 4.2: IGB, formal
  • Definition 4.3: Strong IGB
  • Theorem 5.1: Informal
  • Theorem 5.2: Conditions for strong IGB, informal
  • Theorem 3.1: Lindeberg theorem
  • Theorem 3.2: Distribution convergence
  • Theorem 3.3: Sufficient condition for Thm. \ref{['thm:DistrConv']}
  • proof
  • ...and 10 more