Table of Contents
Fetching ...

Quantitative convergence of trained quantum neural networks to a Gaussian process

Anderson Melchor Hernandez, Filippo Girardi, Davide Pastorello, Giacomo De Palma

TL;DR

An upper bound on the distance between the probability distribution of the function generated by any untrained network with finite width and the Gaussian process with the same covariance is established and implies that for sufficiently large widths, training occurs in the lazy regime.

Abstract

We study quantum neural networks where the generated function is the expectation value of the sum of single-qubit observables across all qubits. In [Girardi \emph{et al.}, arXiv:2402.08726], it is proven that the probability distributions of such generated functions converge in distribution to a Gaussian process in the limit of infinite width for both untrained networks with randomly initialized parameters and trained networks. In this paper, we provide a quantitative proof of this convergence in terms of the Wasserstein distance of order $1$. First, we establish an upper bound on the distance between the probability distribution of the function generated by any untrained network with finite width and the Gaussian process with the same covariance. This proof utilizes Stein's method to estimate the Wasserstein distance of order $1$. Next, we analyze the training dynamics of the network via gradient flow, proving an upper bound on the distance between the probability distribution of the function generated by the trained network and the corresponding Gaussian process. This proof is based on a quantitative upper bound on the maximum variation of a parameter during training. This bound implies that for sufficiently large widths, training occurs in the lazy regime, \emph{i.e.}, each parameter changes only by a small amount. While the convergence result of [Girardi \emph{et al.}, arXiv:2402.08726] holds at a fixed training time, our upper bounds are uniform in time and hold even as $t \to \infty$.

Quantitative convergence of trained quantum neural networks to a Gaussian process

TL;DR

An upper bound on the distance between the probability distribution of the function generated by any untrained network with finite width and the Gaussian process with the same covariance is established and implies that for sufficiently large widths, training occurs in the lazy regime.

Abstract

We study quantum neural networks where the generated function is the expectation value of the sum of single-qubit observables across all qubits. In [Girardi \emph{et al.}, arXiv:2402.08726], it is proven that the probability distributions of such generated functions converge in distribution to a Gaussian process in the limit of infinite width for both untrained networks with randomly initialized parameters and trained networks. In this paper, we provide a quantitative proof of this convergence in terms of the Wasserstein distance of order . First, we establish an upper bound on the distance between the probability distribution of the function generated by any untrained network with finite width and the Gaussian process with the same covariance. This proof utilizes Stein's method to estimate the Wasserstein distance of order . Next, we analyze the training dynamics of the network via gradient flow, proving an upper bound on the distance between the probability distribution of the function generated by the trained network and the corresponding Gaussian process. This proof is based on a quantitative upper bound on the maximum variation of a parameter during training. This bound implies that for sufficiently large widths, training occurs in the lazy regime, \emph{i.e.}, each parameter changes only by a small amount. While the convergence result of [Girardi \emph{et al.}, arXiv:2402.08726] holds at a fixed training time, our upper bounds are uniform in time and hold even as .

Paper Structure

This paper contains 23 sections, 19 theorems, 237 equations, 2 tables.

Key Result

Theorem 1.1

We denote with $\overline{X}$ the vector made by the elements of $\mathcal{X}$, and with $f(\Theta,\overline{X})$ the vector made by the associated outputs. Let $\overline{{\mathcal{K}}}_{0}$ be the covariance matrix of $f(\Theta,\overline{X})$ when the parameters $\Theta$ are randomly initialized. where $\mathcal{N}(0,\overline{{\mathcal{K}}}_{0})$ denotes the centered Gaussian distribution with

Theorems & Definitions (43)

  • Theorem 1.1: Convergence at initialization, informal statement
  • Theorem 1.2: Convergence of the trained network, informal statement
  • Theorem 1.3: Lazy training, informal statement
  • Definition 2.1
  • Definition 2.2
  • Definition 2.3: Light cones
  • Definition 2.4: Extended light cones
  • Remark 2.1
  • Lemma 2.1: girardi2024
  • Lemma 2.2
  • ...and 33 more