Table of Contents
Fetching ...

Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent

Daniel Busbib, Ami Wiesel

TL;DR

This work investigates estimating Toeplitz covariance matrices under a parametric Carathéodory decomposition using overparameterized gradient descent. It shows that modeling the covariance as a sum of $K$ complex sinusoids with amplitudes and frequencies, and optimizing with gradient descent, yields global convergence when $K$ is mildly larger than $P$ (e.g., $K=2P$ or $4P$). The authors introduce an accelerated variant with separate learning rates for amplitudes and frequencies and prove a benign optimization landscape when frequencies are fixed, with stationary points recovering the true covariance. Empirical results demonstrate that overparameterized GD matches or surpasses state-of-the-art methods such as ATOM across structured, AR, and random Carathéodory data, while remaining scalable.

Abstract

We consider covariance estimation under Toeplitz structure. Numerous sophisticated optimization methods have been developed to maximize the Gaussian log-likelihood under Toeplitz constraints. In contrast, recent advances in deep learning demonstrate the surprising power of simple gradient descent (GD) applied to overparameterized models. Motivated by this trend, we revisit Toeplitz covariance estimation through the lens of overparameterized GD. We model the $P\times P$ covariance as a sum of $K$ complex sinusoids with learnable parameters and optimize them via GD. We show that when $K = P$, GD may converge to suboptimal solutions. However, mild overparameterization ($K = 2P$ or $4P$) consistently enables global convergence from random initializations. We further propose an accelerated GD variant with separate learning rates for amplitudes and frequencies. When frequencies are fixed and only amplitudes are optimized, we prove that the optimization landscape is asymptotically benign and any stationary point recovers the true covariance. Finally, numerical experiments demonstrate that overparameterized GD can match or exceed the accuracy of state-of-the-art methods in challenging settings, while remaining simple and scalable.

Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent

TL;DR

This work investigates estimating Toeplitz covariance matrices under a parametric Carathéodory decomposition using overparameterized gradient descent. It shows that modeling the covariance as a sum of complex sinusoids with amplitudes and frequencies, and optimizing with gradient descent, yields global convergence when is mildly larger than (e.g., or ). The authors introduce an accelerated variant with separate learning rates for amplitudes and frequencies and prove a benign optimization landscape when frequencies are fixed, with stationary points recovering the true covariance. Empirical results demonstrate that overparameterized GD matches or surpasses state-of-the-art methods such as ATOM across structured, AR, and random Carathéodory data, while remaining scalable.

Abstract

We consider covariance estimation under Toeplitz structure. Numerous sophisticated optimization methods have been developed to maximize the Gaussian log-likelihood under Toeplitz constraints. In contrast, recent advances in deep learning demonstrate the surprising power of simple gradient descent (GD) applied to overparameterized models. Motivated by this trend, we revisit Toeplitz covariance estimation through the lens of overparameterized GD. We model the covariance as a sum of complex sinusoids with learnable parameters and optimize them via GD. We show that when , GD may converge to suboptimal solutions. However, mild overparameterization ( or ) consistently enables global convergence from random initializations. We further propose an accelerated GD variant with separate learning rates for amplitudes and frequencies. When frequencies are fixed and only amplitudes are optimized, we prove that the optimization landscape is asymptotically benign and any stationary point recovers the true covariance. Finally, numerical experiments demonstrate that overparameterized GD can match or exceed the accuracy of state-of-the-art methods in challenging settings, while remaining simple and scalable.

Paper Structure

This paper contains 14 sections, 2 theorems, 55 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Assume that the complex sinusoids $\{\bm{v}(\omega_k)\}$ span $\mathbb{R}^P$, and $\bm{S} = \bm{C}$ is the true positive definite covariance matrix. If $\widehat{\bm{a}}$ is a stationary point of $\mathrm{NLL}(\bm{a})$ with $\widehat{\bm{C}}(\widehat{\bm{a}}) \succ 0$, then $\widehat{\bm{C}}(\wideha

Figures (6)

  • Figure 1: Validation of empirical Lipschitz approximation across 1000 Monte Carlo trials. Left: The amplitude bound derived from \ref{['eq:La_approx']} (y-axis) upper-bounds the empirical $L_a$ (x-axis). Right: The frequency constant derived from \ref{['eq:Lw_approx']} approximates $L_\omega$, spanning a wider range due to $P^2$ and $\|\hat{\bm{C}}^{-1}\|_2^{3/2}$ factors.
  • Figure 2: RMSE versus sample size for the ATOM benchmark setup. Overparameterized GD (GDx2, GDx4) converges near the CRB without prior knowledge and performs comparably to ATOM.
  • Figure 3: RMSE versus sample size for AR(3) covariance model. Overparameterized GD matches CRB performance with random initialization, while PGD achieves the lowest RMSE due to bias below the CRB.
  • Figure 4: RMSE versus sample size for random Carathéodory Toeplitz covariances. Overparameterized gradient descent (GDx2, GDx4) outperforms ATOM and achieves the CRB.
  • Figure 5: RMSE versus sample size $M$ for different overparameterization factors $K \in [1,2]$. The results indicate that overparameterized models ($K \approx 2P$) achieve substantially lower RMSE, even for small $M$, whereas minimally parameterized configurations ($K \approx P$) remain unstable and sensitive to sample size.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2