Dissecting a Small Artificial Neural Network

Xiguang Yang; Krish Arora; Michael Bachmann

Dissecting a Small Artificial Neural Network

Xiguang Yang, Krish Arora, Michael Bachmann

TL;DR

The paper investigates how a minimal sigmoid XOR network learns and how its high-dimensional loss landscape governs backpropagation convergence. It uses cross-sectional analysis of a $9$-parameter space, examines convergence dynamics under nonrandomized and randomized batches, and introduces a microcanonical entropy $S(L)=k_B \ln g(L)$ estimated via Wang-Landau/multicanonical methods to characterize phase-transition–like learning behavior. Key findings include a three-phase convergence with a long-time decay $L(\tau)\sim \tau^{-\gamma}$ where $\gamma$ grows with the hidden size $n_h$, the existence of zero-loss states for $n_h\ge 2$, and entropic barriers that fade as networks scale, suggesting barrier-free learning in larger systems. The study connects learning dynamics to annealing and phase-transition concepts, showing that the microcanonical-entropy framework can guide training strategies and scaling to broader neural architectures.

Abstract

We investigate the loss landscape and backpropagation dynamics of convergence for the simplest possible artificial neural network representing the logical exclusive-OR (XOR) gate. Cross-sections of the loss landscape in the nine-dimensional parameter space are found to exhibit distinct features, which help understand why backpropagation efficiently achieves convergence toward zero loss, whereas values of weights and biases keep drifting. Differences in shapes of cross-sections obtained by nonrandomized and randomized batches are discussed. In reference to statistical physics we introduce the microcanonical entropy as a unique quantity that allows to characterize the phase behavior of the network. Learning in neural networks can thus be thought of as an annealing process that experiences the analogue of phase transitions known from thermodynamic systems. It also reveals how the loss landscape simplifies as more hidden neurons are added to the network, eliminating entropic barriers caused by finite-size effects.

Dissecting a Small Artificial Neural Network

TL;DR

The paper investigates how a minimal sigmoid XOR network learns and how its high-dimensional loss landscape governs backpropagation convergence. It uses cross-sectional analysis of a

-parameter space, examines convergence dynamics under nonrandomized and randomized batches, and introduces a microcanonical entropy

estimated via Wang-Landau/multicanonical methods to characterize phase-transition–like learning behavior. Key findings include a three-phase convergence with a long-time decay

where

grows with the hidden size

, the existence of zero-loss states for

, and entropic barriers that fade as networks scale, suggesting barrier-free learning in larger systems. The study connects learning dynamics to annealing and phase-transition concepts, showing that the microcanonical-entropy framework can guide training strategies and scaling to broader neural architectures.

Abstract

Paper Structure (10 sections, 9 equations, 8 figures, 2 tables)

This paper contains 10 sections, 9 equations, 8 figures, 2 tables.

Introduction
XOR Neural Network Model and Optimization
Optimization Characteristics of the XOR Network and Loss Landscape
Dependence of Network Performance on the Learning Rate
Convergence Dynamics
Analysis of the Loss Landscape
Nonrandomized Batch
Randomized Batches
Density of Loss and Microcanonical Entropy
Summary

Figures (8)

Figure 1: Parametrization of the simplest artificial neural network used in this study with only $n_\mathrm{h}=2$ neurons in the hidden layer.
Figure 2: Convergence of the loss function for the minimal network with $n_\mathrm{h}=2$ hidden neurons as a function of epochs $\tau$ for various learning rates. For comparison, the loss curve for a larger network with $n_\mathrm{h}=18$ neurons in the hidden layer is also included. Reference lines with values of the exponent $\gamma$ attached support the power-law behavior in the long term.
Figure 3: Drifts of weights and biases for the (a) hidden and (b) output layer as the optimization process progresses through the epochs $\tau$ at learning rate $\eta=0.1$. Note that the pairs of weights $w_{11}^{(1)},w_{12}^{(1)}$ and $w_{21}^{(1)},w_{22}^{(1)}$, respectively, are indistinguishable for this solution (#1 in Table \ref{['tab:sol']}) of the problem.
Figure 4: Convergence of activations $a_i^{(l)}$ of hidden and output neurons to solution #1 in Table \ref{['tab:sol']} for all four cases listed in Table \ref{['tab:xor']} as functions of epoch $\tau$ for the XOR sigmoid network. The learning rate was $\eta=0.1$.
Figure 5: Epoch of convergence $\tau_\mathrm{conv}$ plotted as function of the deviation of initial (a) weights $w_{ij}^{(l)}$ and (b) biases $b_i^{(l)}$ from the respective optimal values for solution #1 given in Table \ref{['tab:sol']}. In each case, all other weights and biases are initialized at their optimal values.
...and 3 more figures

Dissecting a Small Artificial Neural Network

TL;DR

Abstract

Dissecting a Small Artificial Neural Network

Authors

TL;DR

Abstract

Table of Contents

Figures (8)