Table of Contents
Fetching ...

Fisher Information, Training and Bias in Fourier Regression Models

Lorenzo Pastori, Veronika Eyring, Mierk Schwabe

TL;DR

The paper addresses how Fisher information-based metrics, via the ED, predict training dynamics in Fourier-model equivalents of QNNs. It develops an analytic FIM for Fourier models and links ED to the correlation spectrum, enabling tunable ED and bias in model design. The study demonstrates a bias–ED tradeoff: high ED aids unbiased models, while low ED aids biased ones, and shows this behavior persists in tensorized Fourier models that scale to larger problem sizes. Overall, the work clarifies how geometric properties and task alignment govern trainability in quantum-inspired regression, with tensor networks offering a scalable analysis framework.

Abstract

Motivated by the growing interest in quantum machine learning, in particular quantum neural networks (QNNs), we study how recently introduced evaluation metrics based on the Fisher information matrix (FIM) are effective for predicting their training and prediction performance. We exploit the equivalence between a broad class of QNNs and Fourier models, and study the interplay between the \emph{effective dimension} and the \emph{bias} of a model towards a given task, investigating how these affect the model's training and performance. We show that for a model that is completely agnostic, or unbiased, towards the function to be learned, a higher effective dimension likely results in a better trainability and performance. On the other hand, for models that are biased towards the function to be learned a lower effective dimension is likely beneficial during training. To obtain these results, we derive an analytical expression of the FIM for Fourier models and identify the features controlling a model's effective dimension. This allows us to construct models with tunable effective dimension and bias, and to compare their training. We furthermore introduce a tensor network representation of the considered Fourier models, which could be a tool of independent interest for the analysis of QNN models. Overall, these findings provide an explicit example of the interplay between geometrical properties, model-task alignment and training, which are relevant for the broader machine learning community.

Fisher Information, Training and Bias in Fourier Regression Models

TL;DR

The paper addresses how Fisher information-based metrics, via the ED, predict training dynamics in Fourier-model equivalents of QNNs. It develops an analytic FIM for Fourier models and links ED to the correlation spectrum, enabling tunable ED and bias in model design. The study demonstrates a bias–ED tradeoff: high ED aids unbiased models, while low ED aids biased ones, and shows this behavior persists in tensorized Fourier models that scale to larger problem sizes. Overall, the work clarifies how geometric properties and task alignment govern trainability in quantum-inspired regression, with tensor networks offering a scalable analysis framework.

Abstract

Motivated by the growing interest in quantum machine learning, in particular quantum neural networks (QNNs), we study how recently introduced evaluation metrics based on the Fisher information matrix (FIM) are effective for predicting their training and prediction performance. We exploit the equivalence between a broad class of QNNs and Fourier models, and study the interplay between the \emph{effective dimension} and the \emph{bias} of a model towards a given task, investigating how these affect the model's training and performance. We show that for a model that is completely agnostic, or unbiased, towards the function to be learned, a higher effective dimension likely results in a better trainability and performance. On the other hand, for models that are biased towards the function to be learned a lower effective dimension is likely beneficial during training. To obtain these results, we derive an analytical expression of the FIM for Fourier models and identify the features controlling a model's effective dimension. This allows us to construct models with tunable effective dimension and bias, and to compare their training. We furthermore introduce a tensor network representation of the considered Fourier models, which could be a tool of independent interest for the analysis of QNN models. Overall, these findings provide an explicit example of the interplay between geometrical properties, model-task alignment and training, which are relevant for the broader machine learning community.

Paper Structure

This paper contains 36 sections, 114 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: Illustration of main results. (a) Schematic behavior of the difference in the training loss (here the mean squared error --- MSE) $\Delta\mathrm{MSE}$, between models with high and low effective dimension (ED), vs. the corresponding ED difference $\Delta\mathrm{ED}$ (between models with high and low ED --- normalized by the number of trainable parameters $M$), for models with different bias towards the function to be learned (represented by the color scale). Each point represents the average behavior over several model realizations and training experiments. Models with low ED have better training performance than models with high ED (positive $\Delta\mathrm{MSE}$) in the biased case. The converse is true (negative $\Delta\mathrm{MSE}$) in the unbiased case. (b) Visualization of model spaces in the high- (blue surface) and low-ED (brown line) cases, for biased and unbiased case. Points in these spaces represent functions obtained for specific choices of trainable parameters. In the biased case, the data-generating function (red point) belongs to the model space (to good approximation): a model with low ED (brown point) trained with gradient descent is more likely to converge to the data-generating function since there are effectively less dimensions to explore (only one direction leads to minimizing the loss, as represented by the black arrow). A model with high ED (blue point) is instead more likely to incur in local minima (there are multiple directions, represented by the black arrows, leading to similar loss minimization). In the unbiased case the data-generating function is outside the model space: a model with high ED (blue point) is likely to yield better results, as more directions are available for reaching a better approximation to the data-generating function. (c) Illustration of a QNN and the expansion of its output $f_{\boldsymbol{\theta}}(\boldsymbol{x})$ in the basis functions $e_{\mu}(\boldsymbol{x})$ and $\iota_{\nu}(\boldsymbol{\theta})$ (with $\boldsymbol{x}$ denoting the inputs and $\boldsymbol{\theta}$ the trainable parameters). The coefficient matrix $\Gamma$ (structure constants) can be decomposed in orthogonal matrices $U$ and $V$ and a diagonal matrix $S$ of singular values $s_{\rho}$. The ED of the model is controlled by the decay properties of the singular values: a faster decay results in a lower ED.
  • Figure 2: (a) Scaling of normalized ED with the purity $\mathrm{tr}(S^4)$ of the correlation spectrum. (b) Scaling of normalized ED with the ratio $D/M$. Each point corresponds to a random model realization, i.e., a random $\Gamma$ uniformly drawn from $[-1,+1]^{D\times K}$. For every value of $\mathrm{tr}(S^4)$ and $D/M$, $50$ model realizations are drawn (the points are on top of each others). The normalized ED is computed using Eq. \ref{['eq:norm_eff_dim']}, with $150$ parameters samples for estimating the normalized FIM. Here, $\tilde{d}=2$ refers to $\tilde{\mathcal{B}}_m=\{\sqrt{2}\cos\theta_m,\sqrt{2}\sin\theta_m\}$, while $\tilde{d}=3$ refers to $\tilde{\mathcal{B}}_m=\{1,\sqrt{2}\cos\theta_m,\sqrt{2}\sin\theta_m\}$ in Eq.\ref{['eq:param_basis_funs_Fourier']}, for all $m=1,...,M$.
  • Figure 3: (a) Schematic illustration of the construction of biased and unbiased models. The data-generating function $y$ is specified by matrices $U^{\mathrm{(d)}}$ and $V^{\mathrm{(d)}}$, and the model $f$ by $U$ and $V$, with both $y$ and $f$ having the same correlation spectrum $S$. In the unbiased case, $y$ (represented by the red dot) lies outside the space of functions accessible to $f_{\boldsymbol{\theta}}$ (represented by the gray surface), whereas in the biased case $y$ belongs to that space. (b) Construction of models with tunable ED. Full models $f^{\mathrm{(f)}}_{\boldsymbol{\theta}}(\boldsymbol{x})$ with no imposed decay in the correlation spectrum $s_{\rho}$, as illustrated by the blue line, have high ED and therefore can access a larger functions' space, represented by the blue surface. Cutoff models $f^{\mathrm{(c)}}_{\boldsymbol{\theta}}(\boldsymbol{x})$ with decaying correlation spectrum $s_{\rho}$, as illustrated by the brown line, have low ED and have access to more restricted functions' space, represented by the brown surface.
  • Figure 4: (a) $\Delta_{\mathrm{f-c}}\mathrm{MSE}_{\mathrm{min}}$ for different values of $\hat{d}_{\mathrm{eff}}^{\mathrm{full}}-\hat{d}_{\mathrm{eff}}^{\mathrm{cut}}$, for biased (blue points) and unbiased (yellow points) models. Each point corresponds to $\Delta_{\mathrm{f-c}}\mathrm{MSE}_{\mathrm{min}}$ averaged over $30$ training instances starting from randomly chosen parameters, for a single random model realization, i.e., a random $\Gamma$ uniformly drawn from $[-1,+1]^{D\times K}$. The red line serves as a guide for the eye for zero $\mathrm{MSE}$ difference. (b) Training curves for a random biased model realization, with full model in blue and cutoff model in orange. (c) Training curves for a random unbiased model realization, with full model in blue and cutoff model in orange. The shading corresponds to the spread over $30$ training instances. For these plots, $N=1$, $\Omega=\{1,...,8\}$ ($d=17$), $\tilde{\Omega}=\{1\}$ ($\tilde{d}=3$), $M=7$, $R=6$, $\mathfrak{n}_{\mathrm{train}}=25$ with a batch size of $5$.
  • Figure 5: (a) $\Delta_{\mathrm{f-c}}\mathrm{MSE}_{\mathrm{min}}$ for different values of $\hat{d}_{\mathrm{eff}}^{\mathrm{full}}-\hat{d}_{\mathrm{eff}}^{\mathrm{cut}}$, for different values of $\delta_{\mathrm{data}}$ (color scale). Each point corresponds to $\Delta_{\mathrm{f-c}}\mathrm{MSE}_{\mathrm{min}}$ averaged over $30$ training instances for $30$ random model realization. (b) Same as panel (a) but resolved as a function of $\delta_{\mathrm{data}}$. The red line serves as a guide for the eye for zero $\mathrm{MSE}$ difference. For these plots, $N=1$, $\Omega=\{1,...,8\}$ ($d=17$), $\tilde{\Omega}=\{1\}$ ($\tilde{d}=3$), $M=7$, $R=6$, $\mathfrak{n}_{\mathrm{train}}=25$ with a batch size of $5$.
  • ...and 16 more figures