Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent
Daniel Busbib, Ami Wiesel
TL;DR
This work investigates estimating Toeplitz covariance matrices under a parametric Carathéodory decomposition using overparameterized gradient descent. It shows that modeling the covariance as a sum of $K$ complex sinusoids with amplitudes and frequencies, and optimizing with gradient descent, yields global convergence when $K$ is mildly larger than $P$ (e.g., $K=2P$ or $4P$). The authors introduce an accelerated variant with separate learning rates for amplitudes and frequencies and prove a benign optimization landscape when frequencies are fixed, with stationary points recovering the true covariance. Empirical results demonstrate that overparameterized GD matches or surpasses state-of-the-art methods such as ATOM across structured, AR, and random Carathéodory data, while remaining scalable.
Abstract
We consider covariance estimation under Toeplitz structure. Numerous sophisticated optimization methods have been developed to maximize the Gaussian log-likelihood under Toeplitz constraints. In contrast, recent advances in deep learning demonstrate the surprising power of simple gradient descent (GD) applied to overparameterized models. Motivated by this trend, we revisit Toeplitz covariance estimation through the lens of overparameterized GD. We model the $P\times P$ covariance as a sum of $K$ complex sinusoids with learnable parameters and optimize them via GD. We show that when $K = P$, GD may converge to suboptimal solutions. However, mild overparameterization ($K = 2P$ or $4P$) consistently enables global convergence from random initializations. We further propose an accelerated GD variant with separate learning rates for amplitudes and frequencies. When frequencies are fixed and only amplitudes are optimized, we prove that the optimization landscape is asymptotically benign and any stationary point recovers the true covariance. Finally, numerical experiments demonstrate that overparameterized GD can match or exceed the accuracy of state-of-the-art methods in challenging settings, while remaining simple and scalable.
