Table of Contents
Fetching ...

Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent

Santhosh Karnik, Anna Veselovska, Mark Iwen, Felix Krahmer

TL;DR

It is established that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank, the first tensor result of its kind for gradient descent rather than gradient flow.

Abstract

We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.

Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent

TL;DR

It is established that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank, the first tensor result of its kind for gradient descent rather than gradient flow.

Abstract

We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.

Paper Structure

This paper contains 25 sections, 34 theorems, 352 equations, 4 figures.

Key Result

Theorem 3.1

Suppose we have $m$ linear measurements $y = \mathcal{A}(\bm{\mathcal{X}} * \bm{\mathcal{X}}^{\top})$ of a tubal positive semidefinite tensor $\bm{\mathcal{X}} * \bm{\mathcal{X}}^{\top} \in S^{n \times n\times k}_{+}$ where $\bm{\mathcal{X}} \in \mathbb{R}^{n \times r \times k}$ has tubal rank $r \l starting from the initialization $\bm{\mathcal{U}}_0 \in \mathbb{R}^{n \times R \times k}$ where ea

Figures (4)

  • Figure 1: A low tubal-rank factorization of a three-dimensional tensor. Using the (reduced) tubal-SVD, each three-dimensional tensor $\bm{\mathcal{T}}\in \mathbb{R}^{n\times m \times k}$ can be decomposed into a tubal product of three tensors $\bm{\mathcal{T}}= \bm{\mathcal{V}}*\bm{\Sigma}*\bm{\mathcal{W}}^\top$ with $\bm{\mathcal{V}}\in \mathbb{R}^{n\times n \times k}$, $\bm{\mathcal{W}}\in \mathbb{R}^{m\times m \times k}$ and the frontal slice diagonal tensor $\bm{\Sigma} \in \mathbb{R}^{n\times m \times k}$. Here, the tubal rank of a tensor is the number of non-zero singular tubes in $\bm{\Sigma} \in \mathbb{R}^{n\times m \times k}$. For example, in the figure, the tubal rank of the tensor is equal to six.
  • Figure 2: Illustration of (a) the two stages of gradient descent algorithm: the spectral alignment stage for $1\le t \lesssim 3000$ and the convergence stage $3000\lesssim t$ and (b) more details on the alignment phase for the gradient descent progress. In the ground truth tensor $\bm{\mathcal{X}}\in \mathbb{R}^{n\times r \times k}$, we set $n = 10, k = 4, r = 3$.
  • Figure 3: Outcomes of employing gradient descent to minimize the loss function \ref{['eq:loss']} with different overparametrization rates. We set $n = 10, k = 4, r = 3$ in the ground truth tensor $\bm{\mathcal{X}}\in \mathbb{R}^{n\times r \times k}$ and for initialization $\bm{\mathcal{U}}_0\in \mathbb{R}^{n\times R \times k}$, we set the over-rank to $R = 10, 50, 100, 200, 400$. For each $R$ we plot the average over twenty experiments. The plots (a),(b), and (d) are semi-log plots.
  • Figure 4: Impact of different initialization scales on the test and the training error. The data are represented in the semi-log plot. We set $n = 10, k = 4, r = 3$ in the ground truth tensor $\bm{\mathcal{X}}\in \mathbb{R}^{n\times r \times k}$ and for initialization $\bm{\mathcal{U}}_0=\alpha \,\bm{\mathcal{U}}\in \mathbb{R}^{n\times R \times k}$ with $R=200$ and different scales of $\alpha$. The plot depicts the averaged value for five runs and the bars represent the deviations from the mean value.

Theorems & Definitions (63)

  • Theorem 3.1
  • Lemma C.1
  • proof
  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • Lemma D.3
  • proof
  • Lemma D.4
  • ...and 53 more