The power of small initialization in noisy low-tubal-rank tensor recovery

ZHiyu Liu; Haobo Geng; Xudong Wang; Yandong Tang; Zhi Han; Yao Wang

The power of small initialization in noisy low-tubal-rank tensor recovery

ZHiyu Liu, Haobo Geng, Xudong Wang, Yandong Tang, Zhi Han, Yao Wang

Abstract

We study the problem of recovering a low-tubal-rank tensor $\mathcal{X}\_\star\in \mathbb{R}^{n \times n \times k}$ from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as $\mathcal{U} * \mathcal{U}^\top$, where $\mathcal{U} \in \mathbb{R}^{n \times R \times k}$, followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank $r$ of the underlying tensor $\mathcal{X}_\star$ is typically unknown, this method often assumes $r < R \le n$, a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank $R$. To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank $R$ is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank $R$. Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.

The power of small initialization in noisy low-tubal-rank tensor recovery

Abstract

We study the problem of recovering a low-tubal-rank tensor

from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as

, where

, followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank

of the underlying tensor

is typically unknown, this method often assumes

, a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank

. To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank

is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank

. Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.

Paper Structure (39 sections, 22 theorems, 190 equations, 15 figures, 7 tables, 3 algorithms)

This paper contains 39 sections, 22 theorems, 190 equations, 15 figures, 7 tables, 3 algorithms.

Introduction
Related works
Preliminaries
Main results
Factorized gradient descent and t-RIP
Theoretical guarantees
Proof sketch
Early stopping via validation
Experiments
Conclusion
Organization of Appendix
Use of Large Language Models
Reproducibility Statement
Additional Preliminaries
Proof of Theorem \ref{['theorem:main']}
...and 24 more sections

Key Result

Theorem 1

Let $\bm{\mathcal{Y}}\in\mathbb{R}^{m\times n \times k}$, then it can be factored as $\bm{\mathcal{Y}}=\bm{\mathcal{V}}_{\bm{\mathcal{Y}}} * \bm{\mathcal{S}}_{\bm{\mathcal{Y}}} * \bm{\mathcal{W}}_{\bm{\mathcal{Y}}}^\top$ where $\bm{\mathcal{V}}_{\bm{\mathcal{Y}}}\in\mathbb{R}^{m \times m \times k}$,

Figures (15)

Figure 1: Comparison of training and testing errors for Problem (\ref{['equ:3']}) using FGD with spectral vs. small initialization. The ground-truth tensor has tubal-rank $r=2$, overestimated rank $R=4$, size $n=20$, $k=3$, $m=5kr(2n-r)$ measurements, and noise $\sigma=10^{-3}$. Spectral initialization follows liu2024low, while small initialization uses a near-zero starting point. Training error is $\frac{1}{4m}||\bm{y}-\bm{\mathfrak{M}}(\bm{\mathcal{U}}*\bm{\mathcal{U}}^\top)||^2$, and testing error is $||\bm{\mathcal{U}}*\bm{\mathcal{U}}^\top - \bm{\mathcal{X}}_\star||_F^2/||\bm{\mathcal{X}}_\star||_F^2$. “Baseline” denotes recovery under exact rank $R=r$. Insets show early (first 500 iterations) vs. full error curves.
Figure 2: Performance comparison under varying $r$, $\sigma$, $n$, and $m$. Subfigure (a) illustrates the recovery error of all methods under different over-rank values $R$, with parameters set as $m = 10nrk$, $n = 30$, $\sigma = 10^{-3}$, $\eta = 0.1$, and $T = 5000$. Subfigure (b) illustrates the error under varying noise levels $\sigma$, with $m = 10nrk$, $n = 30$, $R = 3r$, $\eta = 0.1$, and $T = 5000$. Subfigure (c) illustrates the error as the problem dimension $n$ changes, where $m = 10nrk$, $R = 3r$, $\eta = 0.1$, $T = 20000$, and $\sigma = 10^{-3}$. Subfigure (d) illustrates the performance under different numbers of measurements $C_m$, with $m = 2C_m nrk$, $n = 30$, $R = 3r$, $\eta = 0.01$, $T = 20000$, and $\sigma = 10^{-3}$.
Figure 3: Validation of the algorithm with $m = 10nrk$, $R = 3r$, $n = 30$, $\sigma = 10^{-3}$, $\eta = 0.1$. (a) Validation loss vs. RSE, with the blue dot marking the minimum. (b) Error of the validation-based method compared with the minimum errors of baseline and small-initialization under varying $m_{\text{train}}$.
Figure 4: Validation of the sensitivity of FGD to different tubal-ranks.
Figure 5: Validation of the four-phase convergence analysis in Section 3.3. The left panel shows the first 1,000 iterations; the right panel shows the full 10,000 iterations. The orange curve corresponds to the orange axis on the right, and the blue curve corresponds to the blue axis on the left. Parameter settings: $n=10$, $k=3$, $r=2$, $R=10$, $m=5knR$, $\eta=0.1$, noise standard deviation $\sigma=0.01$, and initialization scale $\alpha=10^{-7}$.
...and 10 more figures

Theorems & Definitions (54)

Theorem 1: t-SVD kilmer2011factorization
Definition 1: t-RIP zhang2021tensor
Theorem 2
Remark 1
Theorem 3: Minimax error
Corollary 1
Remark 2
Remark 3
Remark 4
Remark 5
...and 44 more

The power of small initialization in noisy low-tubal-rank tensor recovery

Abstract

The power of small initialization in noisy low-tubal-rank tensor recovery

Authors

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (54)