Table of Contents
Fetching ...

Gradient Descent Learns Linear Dynamical Systems

Moritz Hardt, Tengyu Ma, Benjamin Recht

TL;DR

The paper shows that stochastic gradient descent, when combined with a projection onto a convex acquiescence region, efficiently learns unknown linear dynamical systems from noisy sequence data despite non-convexity. By translating the problem into a frequency-domain idealized risk that is weakly quasi-convex, the authors obtain polynomial-time convergence and sample complexity, and they extend the framework to improper learning and MIMO settings. A key idea is the acquiescence condition on the denominator of the transfer function, which controls the optimization landscape; over-parameterization further broadens applicability. The work connects system identification, passive systems, and modern optimization to yield practical, theoretically grounded guarantees, supported by simulations. The results offer a principled path toward tractable learning for a broad class of linear and MIMO dynamical systems in noisy environments.

Abstract

We prove that stochastic gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider.

Gradient Descent Learns Linear Dynamical Systems

TL;DR

The paper shows that stochastic gradient descent, when combined with a projection onto a convex acquiescence region, efficiently learns unknown linear dynamical systems from noisy sequence data despite non-convexity. By translating the problem into a frequency-domain idealized risk that is weakly quasi-convex, the authors obtain polynomial-time convergence and sample complexity, and they extend the framework to improper learning and MIMO settings. A key idea is the acquiescence condition on the denominator of the transfer function, which controls the optimization landscape; over-parameterization further broadens applicability. The work connects system identification, passive systems, and modern optimization to yield practical, theoretically grounded guarantees, supported by simulations. The results offer a principled path toward tractable learning for a broad class of linear and MIMO dynamical systems in noisy environments.

Abstract

We prove that stochastic gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider.

Paper Structure

This paper contains 40 sections, 38 theorems, 132 equations, 2 figures, 3 algorithms.

Key Result

Theorem 1.1

Under our assumption, projected stochastic gradient descent, when given $N$ sample sequence of length $T$, returns parameters $\widehat{\Theta}$ with population risk

Figures (2)

  • Figure 1: An example of polynomial $q$ that satisfies our assumption. The unit circle is the collection of the inputs of $q$ and the other curve shows the corresponding outputs (with the corresponding colors.) We see the image of the polynomial stays in the wedge which contains all the complex number $z$ satisfying $\Re(q(z))>|\Im(q(z))|.$
  • Figure 2: The performance of projected stochastic gradient descent with over-parameterization, vanilla SGD, and SGD with gradient clipping, on three different instance of dynamical systems with true state dimension = 20. The solid lines are from our proposed projected SGD with (over-parameterized) state dimension = 20, 25, 30, 35. The dot line corresponds to SGD with gradient clipped to Frobenius norm 1. The dashed lines correspond vanilla SGD and the triangle marker means the error blows up to infinity. The plot demonstrates the effect of the over-parameterization to our our algorithm. We note that the loss are different scales because the true systems in these three instances have different norms of impulse responses (which is equal to the loss of zero fitting).

Theorems & Definitions (47)

  • Theorem 1.1: Informal
  • Theorem 1.2: Informal
  • Definition 2.1: Weak quasi-convexity
  • Definition 2.2
  • Proposition 2.3
  • Remark 2.4
  • Proposition 2.5
  • Definition 3.1: Idealized risk
  • Proposition 3.2
  • Lemma 3.3
  • ...and 37 more