Table of Contents
Fetching ...

Double descent: When do neural quantum states generalize?

M. Schuyler Moss, Alev Orfi, Christopher Roth, Anirvan M. Sengupta, Antoine Georges, Dries Sels, Anna Dawid, Agnes Valenti

Abstract

Neural quantum states (NQS) provide flexible and compact wavefunction parameterizations for numerical studies of quantum many-body physics. In particular, NQS aim to circumvent the exponential scaling of the Hilbert space by compressing quantum many-body wavefunctions with a tractable amount of parameters. While inspired by deep learning, it remains unclear to what extent NQS share characteristics with neural networks used for standard machine learning tasks. We demonstrate that, in a simplified supervised setting, NQS exhibit the double descent phenomenon, a key feature of modern deep learning, where generalization worsens as network size increases before improving again in an overparameterized regime. Notably, we find the second descent to occur only for network sizes much larger than the Hilbert space dimension, i.e. network sizes that are out of reach for problems of practical interest. Within our setting, this observation places typical NQS in the underparameterized regime. We also observe that the optimal network size in the underparameterized regime depends on the number of unique training samples. While the double descent phenomenon does indeed translate to the NQS setting, potential practical consequences of our findings point more towards the need for symmetry-aware, physics-informed architecture design, rather than directly adopting machine learning heuristics.

Double descent: When do neural quantum states generalize?

Abstract

Neural quantum states (NQS) provide flexible and compact wavefunction parameterizations for numerical studies of quantum many-body physics. In particular, NQS aim to circumvent the exponential scaling of the Hilbert space by compressing quantum many-body wavefunctions with a tractable amount of parameters. While inspired by deep learning, it remains unclear to what extent NQS share characteristics with neural networks used for standard machine learning tasks. We demonstrate that, in a simplified supervised setting, NQS exhibit the double descent phenomenon, a key feature of modern deep learning, where generalization worsens as network size increases before improving again in an overparameterized regime. Notably, we find the second descent to occur only for network sizes much larger than the Hilbert space dimension, i.e. network sizes that are out of reach for problems of practical interest. Within our setting, this observation places typical NQS in the underparameterized regime. We also observe that the optimal network size in the underparameterized regime depends on the number of unique training samples. While the double descent phenomenon does indeed translate to the NQS setting, potential practical consequences of our findings point more towards the need for symmetry-aware, physics-informed architecture design, rather than directly adopting machine learning heuristics.

Paper Structure

This paper contains 23 sections, 16 equations, 19 figures.

Figures (19)

  • Figure 1: (a) A schematic showing the general features of double descent for deep neural networks belkin_reconciling_2019. (b) Training and test loss as a function of the number of network parameters when our NQS are trained on $\mathcal{D}_{\rm Train}^{\mathrm{top}\,75\%}$. Markers represent the loss for an individual trained network, and the solid lines represent the averages over ten random initializations. (c) The infidelity between the trained NQS and the true ground state $\vert\Omega\rangle$. The black vertical line represents our estimate of the interpolation threshold. The gray dashed line (blue dashed line) indicates where the number of network parameters equals the size of the Hilbert space, $N_\mathrm{params}=2^N$ (the number of training configurations, $N_\mathrm{params}=75\% \times 2^N$).
  • Figure 2: The variance of the training and test loss and the infidelities presented in \ref{['fig:main_DD']} as a function of the number of network parameters. The variance is taken across ten random initializations for each network size. The vertical lines follow the same convention as in \ref{['fig:main_DD']}.
  • Figure 3: (a) Test and training loss for NQS trained on uniformly sampled training data $\mathcal{D}_{\rm Train}^{\mathrm{unif.}\,75\%}$. Markers represent the loss for an individual trained network, and the solid lines represent the averages over ten random initializations and datasets. (b) Infidelity between the trained NQS and the true ground state $\vert\Omega\rangle$. The vertical lines follow the same convention as in \ref{['fig:main_DD']}. Panels (c) and (d) show the largest squared wavefunction amplitudes of the exact ground state, with dots and stars indicating the training and test configurations, respectively. These two dataset splittings exemplify the feature in the training data that leads to the two types of behavior in the test loss and infidelity. In (c), only one of the two highest-probability configurations is in $\mathcal{D}_{\rm Train}^{\mathrm{unif.}\,75\%}$; in (d), both configurations are in $\mathcal{D}_{\rm Train}^{\mathrm{unif.}\,75\%}$.
  • Figure 4: (a) The normalization constant $\mathcal{N}$ and (b) the parity error $\epsilon_{\rm parity}$ for NQS trained on different training datasets. Purple markers show these metrics for NQS trained on $\mathcal{D}_{\rm Train}^{\mathrm{top}\,75\%}$. Orange markers show these metrics for NQS trained on the $\mathcal{D}_{\rm Train}^{\mathrm{unif.}\,75\%}$ datasets which contain only a single high-probability configuration, e.g. shown in \ref{['fig:random_DD']}(c). Note the difference in y-axis scale above and below $\mathcal{N}=1$ in (a), marked by the shaded region. Markers, solid lines, and the vertical lines follow the same convention as in \ref{['fig:main_DD']}.
  • Figure 5: (a),(c) Training and test loss as a function of the number of network parameters when our NQS are trained on $\mathcal{D}_{\rm Train}^{\mathrm{top}\,50\%}$ and $\mathcal{D}_{\rm Train}^{\mathrm{top}\,25\%}$, respectively. (b),(d) The infidelity between the corresponding trained wavefunctions and the true ground state $\vert\Omega\rangle$. Markers represent individual trained networks, and the solid lines represent the averages over ten random initializations. The gray dashed line (blue dashed line) indicates where the number of network parameters equals the size of the Hilbert space, $N_\mathrm{params}=2^N$ (the number of training configurations, $N_\mathrm{params}=50\% \times 2^N$ in the first column and $N_\mathrm{params}=25\% \times 2^N$ in the second column).
  • ...and 14 more figures