Table of Contents
Fetching ...

Analog Physical Systems Can Exhibit Double Descent

Sam Dillavou, Jason W Rocks, Jacob F Wycoff, Andrea J Liu, Douglas J Durian

TL;DR

This work demonstrates double descent in a decentralized analog network of self-adjusting resistive elements, and shows that analog physical systems, if appropriately trained, can exhibit behaviors underlying the success of digital AI.

Abstract

An important component of the success of large AI models is double descent, in which networks avoid overfitting as they grow relative to the amount of training data, instead improving their performance on unseen data. Here we demonstrate double descent in a decentralized analog network of self-adjusting resistive elements. This system trains itself and performs tasks without a digital processor, offering potential gains in energy efficiency and speed -- but must endure component non-idealities. We find that standard training fails to yield double descent, but a modified protocol that accommodates this inherent imperfection succeeds. Our findings show that analog physical systems, if appropriately trained, can exhibit behaviors underlying the success of digital AI. Further, they suggest that biological systems might similarly benefit from over-parameterization.

Analog Physical Systems Can Exhibit Double Descent

TL;DR

This work demonstrates double descent in a decentralized analog network of self-adjusting resistive elements, and shows that analog physical systems, if appropriately trained, can exhibit behaviors underlying the success of digital AI.

Abstract

An important component of the success of large AI models is double descent, in which networks avoid overfitting as they grow relative to the amount of training data, instead improving their performance on unseen data. Here we demonstrate double descent in a decentralized analog network of self-adjusting resistive elements. This system trains itself and performs tasks without a digital processor, offering potential gains in energy efficiency and speed -- but must endure component non-idealities. We find that standard training fails to yield double descent, but a modified protocol that accommodates this inherent imperfection succeeds. Our findings show that analog physical systems, if appropriately trained, can exhibit behaviors underlying the success of digital AI. Further, they suggest that biological systems might similarly benefit from over-parameterization.

Paper Structure

This paper contains 4 sections, 15 equations, 7 figures.

Figures (7)

  • Figure 1: Network and Task Details(a) Experiment Image of the Contrastive Local Learning Network (CLLN) with decorated input (gray and white) and output (blue and orange) node locations. The blue nodes labeled $+$ are directly connected, as are the two orange nodes labeled $-$. (b) Network Structure A schematic of the self-adjusting edges (gray lines) connecting the variable inputs ($V_i$, white), constant inputs ($V_-\approx0.02$ V and $V_0\approx 0.23$ V, dark gray), and outputs ($O_\pm$, orange and blue). The smaller gray squares denote hidden nodes (unlabled in (a)). Nodes in (a) are labeled by the subscripts used in (b). (c) Hinge Loss Function The losses for each class are shown. Both classes have nonzero loss within the shaded region. (d) Example Task Result Results for a task with 6 training data points (circles) trained with the hinge loss in (c). The true classification division is drawn as a black dashed line. Note one label has been flipped, representing noise. Background color denotes output after training as a function of inputs $O(V_1,V_2)$. Inputs outside of the colored circle region were not used in training or testing.
  • Figure 2: Experimental Results(a) Classification Error at the end of training vs parameters divided by training datapoints ($\gamma =P/M$) for training (gray squares) and test (purple circles) sets. Both are monotonic functions of $\gamma$. Error bars are standard error, but are all smaller than the markers. (b) Hinge Error for the same experiments. Error bars are again standard error. Now, the test error is a strongly non-monotonic function of $\gamma$, with a peak at $\gamma =32/5$.
  • Figure 3: Output Distributions(a) Example Results of individual training runs for $\gamma=P/M$ values of 32/32, 32/5, and 32/2. Orange triangles and blue circles are training data from the $+$ and $-$ classes respectively. The dashed line is true class division, and the background is the network output after training. (b) Output Distributions Loss function (top) and final output distribution (bottom) as a function of output $O$. Data is shown only for $-$ class (blue, label $=-\rho$), equivalent plot for $+$ class is shown in Appendix \ref{['appendix:additionalresults']}, Fig. \ref{['fig:S3']}. Shaded regions are zero-loss for this class. Training error (left) intuitively decreases with $\gamma$ as parameters grow relative to datapoints. Test error peaks at mid-range $\gamma$, here shown by the long right-side tail for $\gamma=5.3$.
  • Figure 4: Comparison with Digital Networks(a) Training Hinge Loss vs 1 / data points ($1/M$). Experimental results (purple) are compared with average training loss for an ensemble of neural networks with 8 (solid lines) and 7 (dashed lines) trainable parameters ($P$). Networks with ReLU (yellow) and tanh (blue) activations were used. These networks were trained on a two-feature (2D input dimension) binary classification task generated in the same manner as the experimental data. (b) Test Hinge Loss vs 1 / datapoints for the same task and networks. $M^*$ denotes the test loss peak. Inset:$M^*$ vs parameters $P$ for ReLu (yellow squares) and tanh (blue triangles) networks. Networks with 7 and 8 parameters (curves in (a) and (b)) are highlighted. The horizontal purple line denotes the experimental $M^*$ value, and the black dotted line denotes the theoretical (large-scale) prediction $M^*=2P$
  • Figure 5: Typical Experimental Results Without Overclamping(a) Classification Error at the end of training vs parameters divided by training datapoints ($\gamma =P/M$) for training (gray) and test (purple) sets. Error bars are standard error. (b) Hinge Error for the same experiments vs training datapoints for training (gray) and test (purple) sets. Error bars are standard error. Note that here $\rho \approx 11$ mV, and similarly shaped curves result from larger labels. Labels as small as those used in the main text ($\rho \approx 2.3$ mV) cannot be meaningfully distinguished through training without overclamping.
  • ...and 2 more figures