Table of Contents
Fetching ...

Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

Yuetian Luo, Anru R. Zhang

TL;DR

This work develops a unified Riemannian optimization framework for tensor-on-tensor regression with unknown Tucker rank, introducing Riemannian Gradient Descent (RGD) and Riemannian Gauss-Newton (RGN) to recover a low Tucker-rank parameter under rank over-parameterization. The authors prove linear convergence for RGD and quadratic convergence for RGN to a statistically optimal estimate, even when the rank is over-specified, and reveal an adaptive behavior where the algorithms need no rank-tuning changes. They establish a sharp statistical-computational gap using low-degree polynomials, showing that for order-3 or higher tensors, moderate rank over-parameterization can be essentially cost-free in sample complexity for computationally feasible estimators, unlike the matrix case. The paper also provides practical spectral initializations, specialized results for scalar-on-tensor and tensor-on-vector problems, and extensive numerical experiments that corroborate the theory and demonstrate advantages over existing methods.

Abstract

We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.

Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

TL;DR

This work develops a unified Riemannian optimization framework for tensor-on-tensor regression with unknown Tucker rank, introducing Riemannian Gradient Descent (RGD) and Riemannian Gauss-Newton (RGN) to recover a low Tucker-rank parameter under rank over-parameterization. The authors prove linear convergence for RGD and quadratic convergence for RGN to a statistically optimal estimate, even when the rank is over-specified, and reveal an adaptive behavior where the algorithms need no rank-tuning changes. They establish a sharp statistical-computational gap using low-degree polynomials, showing that for order-3 or higher tensors, moderate rank over-parameterization can be essentially cost-free in sample complexity for computationally feasible estimators, unlike the matrix case. The paper also provides practical spectral initializations, specialized results for scalar-on-tensor and tensor-on-vector problems, and extensive numerical experiments that corroborate the theory and demonstrate advantages over existing methods.

Abstract

We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
Paper Structure (56 sections, 32 theorems, 155 equations, 9 figures, 1 table, 7 algorithms)

This paper contains 56 sections, 32 theorems, 155 equations, 9 figures, 1 table, 7 algorithms.

Key Result

Lemma 1

For $f({\mathbfcal{X}})$ in eq:minimization, ${\rm grad}\, f({\mathbfcal{X}}) = P_{T_{\mathbfcal{X}}}(\mathscr{A}^*(\mathscr{A}({\mathbfcal{X}}) - {\mathbfcal{Y}})),$ where $\mathscr{A}^*$ is the adjoint operator of $\mathscr{A}$.

Figures (9)

  • Figure 1: Comparison of sample size requirements in over-parameterized matrix trace (Panel (a)) and scalar-on-tensor regressions (Panel (b)) under Gaussian ensemble design. Here the red line denotes the sample size ($n$) requirements for the RGD and RGN to succeed with input rank $r$ and spectral initialization and the black line ($n_{\text{comp}}$) is the sample complexity of the computational limit, i.e. the minimum sample size requirement for any efficient algorithms. For simplicity, we assume $p_1 = \ldots = p_d = p$, $r_1 = \ldots = r_d = r$, $r^*_1 = \ldots = r^*_d = r^*$, $d$ and $r^*$ are some fixed constants, ${\mathbfcal{E}} = 0$ and ${\mathbfcal{X}}^*$ is well-conditioned.
  • Figure 2: Pictorial illustration of steps in Riemannian optimization
  • Figure 3: Convergence performance of RGD/RGN in over-parameterized scalar-on-tensor regression with spectral initialization. Here, $p = 30, r^* = 3, r= 10$.
  • Figure 4: Convergence performance of RGD/RGN in over-parameterized tensor-on-vector regression with spectral initialization. Here $p = 30, r^* = 3, r= 10$.
  • Figure 5: Convergence performance of RGD/RGN in over-parameterized scalar-on-tensor regression with spectral initialization. Here $p = 30, r^* = 3,n \in [500,8000], r\in \{3,6,9,12,15 \}$.
  • ...and 4 more figures

Theorems & Definitions (44)

  • Lemma 1: Riemannian gradient
  • Lemma 2
  • Remark 1: Riemannian Optimization for Bounded Rank Constraint
  • Definition 1: Tensor Restricted Isometry Property (TRIP)
  • Proposition 1: TRIP Under sub-Gaussian
  • Theorem 1: Convergence of RGD
  • Theorem 2: Convergence of RGN
  • Remark 2: General Input Rank and Under-parameterization
  • Remark 3
  • Remark 4: Conditions
  • ...and 34 more