Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

Yuetian Luo; Anru R. Zhang

Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

Yuetian Luo, Anru R. Zhang

TL;DR

This work develops a unified Riemannian optimization framework for tensor-on-tensor regression with unknown Tucker rank, introducing Riemannian Gradient Descent (RGD) and Riemannian Gauss-Newton (RGN) to recover a low Tucker-rank parameter under rank over-parameterization. The authors prove linear convergence for RGD and quadratic convergence for RGN to a statistically optimal estimate, even when the rank is over-specified, and reveal an adaptive behavior where the algorithms need no rank-tuning changes. They establish a sharp statistical-computational gap using low-degree polynomials, showing that for order-3 or higher tensors, moderate rank over-parameterization can be essentially cost-free in sample complexity for computationally feasible estimators, unlike the matrix case. The paper also provides practical spectral initializations, specialized results for scalar-on-tensor and tensor-on-vector problems, and extensive numerical experiments that corroborate the theory and demonstrate advantages over existing methods.

Abstract

We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.

Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

TL;DR

Abstract

Paper Structure (56 sections, 32 theorems, 155 equations, 9 figures, 1 table, 7 algorithms)

This paper contains 56 sections, 32 theorems, 155 equations, 9 figures, 1 table, 7 algorithms.

Introduction
Central Questions
Our Contributions
Related Prior Work
Organization of the Paper
Notation and Preliminaries
Riemannian Optimization for Tensor-on-Tensor Regression
Geometry of Low Tucker Rank Tensor Manifolds
Riemannian Gradient Descent and Gauss-Newton for Tensor-on-Tensor Regression
Theory of RGD/RGN in Tensor-on-Tensor Regression
Applications, Initialization, and Guarantees in Specific Scenarios
Scalar-on-tensor Regression
Tensor-on-vector Regression
Matrix Trace Regression
Rank-$1$ Tensor-on-tensor Regression
...and 41 more sections

Key Result

Lemma 1

For $f({\mathbfcal{X}})$ in eq:minimization, ${\rm grad}\, f({\mathbfcal{X}}) = P_{T_{\mathbfcal{X}}}(\mathscr{A}^*(\mathscr{A}({\mathbfcal{X}}) - {\mathbfcal{Y}})),$ where $\mathscr{A}^*$ is the adjoint operator of $\mathscr{A}$.

Figures (9)

Figure 1: Comparison of sample size requirements in over-parameterized matrix trace (Panel (a)) and scalar-on-tensor regressions (Panel (b)) under Gaussian ensemble design. Here the red line denotes the sample size ($n$) requirements for the RGD and RGN to succeed with input rank $r$ and spectral initialization and the black line ($n_{\text{comp}}$) is the sample complexity of the computational limit, i.e. the minimum sample size requirement for any efficient algorithms. For simplicity, we assume $p_1 = \ldots = p_d = p$, $r_1 = \ldots = r_d = r$, $r^*_1 = \ldots = r^*_d = r^*$, $d$ and $r^*$ are some fixed constants, ${\mathbfcal{E}} = 0$ and ${\mathbfcal{X}}^*$ is well-conditioned.
Figure 2: Pictorial illustration of steps in Riemannian optimization
Figure 3: Convergence performance of RGD/RGN in over-parameterized scalar-on-tensor regression with spectral initialization. Here, $p = 30, r^* = 3, r= 10$.
Figure 4: Convergence performance of RGD/RGN in over-parameterized tensor-on-vector regression with spectral initialization. Here $p = 30, r^* = 3, r= 10$.
Figure 5: Convergence performance of RGD/RGN in over-parameterized scalar-on-tensor regression with spectral initialization. Here $p = 30, r^* = 3,n \in [500,8000], r\in \{3,6,9,12,15 \}$.
...and 4 more figures

Theorems & Definitions (44)

Lemma 1: Riemannian gradient
Lemma 2
Remark 1: Riemannian Optimization for Bounded Rank Constraint
Definition 1: Tensor Restricted Isometry Property (TRIP)
Proposition 1: TRIP Under sub-Gaussian
Theorem 1: Convergence of RGD
Theorem 2: Convergence of RGN
Remark 2: General Input Rank and Under-parameterization
Remark 3
Remark 4: Conditions
...and 34 more

Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

TL;DR

Abstract

Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (44)