Table of Contents
Fetching ...

Learning with little mixing

Ingvar Ziemann, Stephen Tu

TL;DR

This paper addresses regression on dependent time-series data under square loss, showing that an LSE can attain iid-like excess risk rates after a finite burn-in provided a trajectory hypercontractivity condition holds and the covariate process is mildly ergodic. The authors develop a one-sided, martingale-based analysis that yields fast rates without mixing-time deflation, and they treat unbounded trajectories via truncation-based coupling. They instantiate the theory in parametric settings, including linear dynamical systems and GLMs, achieving nearly minimax optimal rates after polynomial burn-in, and they demonstrate the phenomenon of learning with little mixing through numerical experiments. The framework unifies finite- and infinite-dimensional function classes (e.g., ellipsoids in $\ell^2(\mathbb{N})$) and provides tools for system identification with dependent covariates, offering new benchmarks for dependent-data learning in time-series contexts.

Abstract

We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+ε}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.

Learning with little mixing

TL;DR

This paper addresses regression on dependent time-series data under square loss, showing that an LSE can attain iid-like excess risk rates after a finite burn-in provided a trajectory hypercontractivity condition holds and the covariate process is mildly ergodic. The authors develop a one-sided, martingale-based analysis that yields fast rates without mixing-time deflation, and they treat unbounded trajectories via truncation-based coupling. They instantiate the theory in parametric settings, including linear dynamical systems and GLMs, achieving nearly minimax optimal rates after polynomial burn-in, and they demonstrate the phenomenon of learning with little mixing through numerical experiments. The framework unifies finite- and infinite-dimensional function classes (e.g., ellipsoids in ) and provides tools for system identification with dependent covariates, offering new benchmarks for dependent-data learning in time-series contexts.

Abstract

We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the and norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.
Paper Structure (50 sections, 32 theorems, 194 equations, 2 figures)

This paper contains 50 sections, 32 theorems, 194 equations, 2 figures.

Key Result

Theorem 4.1

Fix $B > 0$, $C : (0, B] \to \mathbb{R}_+$, $\alpha \in [1,2]$, and $r \in (0, B]$. Suppose that $\mathscr{F}_\star$ is star-shaped and $B$-bounded. Let $\mathscr{F}_r \subset \mathscr{F}_\star$ be a $r/\sqrt{8}$-net of $\partial B(r)$ in the supremum norm $\lVert \cdot \rVert_\infty$, and suppose t

Figures (2)

  • Figure 1: $L^2$ excess risk as a function of dataset length $T$ of the empirical risk minimizer on the single trajectory (Trajectory) dataset versus the independent baseline (Ind Baseline) dataset.
  • Figure 2: Ratio of the $L^2$ excess risk as a function of dataset length $T$ of the empirical risk minimizer (ERM) on the single trajectory dataset over the ERM on the independent baseline dataset. The dashed green curve (Ideal) marks a ratio of exactly one.

Theorems & Definitions (56)

  • Definition 4.1: Trajectory $(C,\alpha)$-hypercontractivity
  • Definition 4.2: Dependency matrix, samson2000concentration
  • Definition 4.3: Martingale offset complexity, cf. liang2015learning, ziemann2022single
  • Theorem 4.1
  • Corollary 4.1
  • Corollary 4.2
  • Theorem 5.1: samson2000concentration
  • Proposition 5.1
  • Theorem 5.2
  • Proposition 6.1
  • ...and 46 more