Table of Contents
Fetching ...

Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization

Eliot Shekhtman, Yichen Zhou, Ingvar Ziemann, Nikolai Matni, Stephen Tu

TL;DR

The paper introduces a general Hellinger localization framework to derive instance-optimal parameter recovery rates for maximum-likelihood estimation from multi-trajectory data, without relying on trajectory mixing assumptions. The approach first controls the squared Hellinger distance at the path-measure level via an i.i.d.-reduction, then localizes this distance to a quadratic form in parameter space weighted by the trajectory Fisher information, enabling data-budget scaling of up to $mT$. The framework is instantiated in four case studies (mixtures of Markov chains, dependent regression with non-Gaussian noise, non-monotonic sinusoidal GLMs, and linear-attention sequence models), each achieving near instance-optimal rates that match asymptotic normality up to logarithmic factors and substantially improve upon standard reductions. This yields sharper, trajectory-length-aware bounds that scale with the full data budget, yielding practical implications for training with many sequential data streams such as large language models and attention-based sequence models. Overall, the work provides a broad, information-theoretic toolkit for nearly optimal parameter recovery in non-i.i.d. multi-trajectory settings and demonstrates its versatility across diverse sequential models.

Abstract

Learning from temporally-correlated data is a core facet of modern machine learning. Yet our understanding of sequential learning remains incomplete, particularly in the multi-trajectory setting where data consists of many independent realizations of a time-indexed stochastic process. This important regime both reflects modern training pipelines such as for large foundation models, and offers the potential for learning without the typical mixing assumptions made in the single-trajectory case. However, instance-optimal bounds are known only for least-squares regression with dependent covariates; for more general models or loss functions, the only broadly applicable guarantees result from a reduction to either i.i.d. learning, with effective sample size scaling only in the number of trajectories, or an existing single-trajectory result when each individual trajectory mixes, with effective sample size scaling as the full data budget deflated by the mixing-time. In this work, we significantly broaden the scope of instance-optimal rates in multi-trajectory settings via the Hellinger localization framework, a general approach for maximum likelihood estimation. Our method proceeds by first controlling the squared Hellinger distance at the path-measure level via a reduction to i.i.d. learning, followed by localization as a quadratic form in parameter space weighted by the trajectory Fisher information. This yields instance-optimal bounds that scale with the full data budget under a broad set of conditions. We instantiate our framework across four diverse case studies: a simple mixture of Markov chains, dependent linear regression under non-Gaussian noise, generalized linear models with non-monotonic activations, and linear-attention sequence models. In all cases, our bounds nearly match the instance-optimal rates from asymptotic normality, substantially improving over standard reductions.

Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization

TL;DR

The paper introduces a general Hellinger localization framework to derive instance-optimal parameter recovery rates for maximum-likelihood estimation from multi-trajectory data, without relying on trajectory mixing assumptions. The approach first controls the squared Hellinger distance at the path-measure level via an i.i.d.-reduction, then localizes this distance to a quadratic form in parameter space weighted by the trajectory Fisher information, enabling data-budget scaling of up to . The framework is instantiated in four case studies (mixtures of Markov chains, dependent regression with non-Gaussian noise, non-monotonic sinusoidal GLMs, and linear-attention sequence models), each achieving near instance-optimal rates that match asymptotic normality up to logarithmic factors and substantially improve upon standard reductions. This yields sharper, trajectory-length-aware bounds that scale with the full data budget, yielding practical implications for training with many sequential data streams such as large language models and attention-based sequence models. Overall, the work provides a broad, information-theoretic toolkit for nearly optimal parameter recovery in non-i.i.d. multi-trajectory settings and demonstrates its versatility across diverse sequential models.

Abstract

Learning from temporally-correlated data is a core facet of modern machine learning. Yet our understanding of sequential learning remains incomplete, particularly in the multi-trajectory setting where data consists of many independent realizations of a time-indexed stochastic process. This important regime both reflects modern training pipelines such as for large foundation models, and offers the potential for learning without the typical mixing assumptions made in the single-trajectory case. However, instance-optimal bounds are known only for least-squares regression with dependent covariates; for more general models or loss functions, the only broadly applicable guarantees result from a reduction to either i.i.d. learning, with effective sample size scaling only in the number of trajectories, or an existing single-trajectory result when each individual trajectory mixes, with effective sample size scaling as the full data budget deflated by the mixing-time. In this work, we significantly broaden the scope of instance-optimal rates in multi-trajectory settings via the Hellinger localization framework, a general approach for maximum likelihood estimation. Our method proceeds by first controlling the squared Hellinger distance at the path-measure level via a reduction to i.i.d. learning, followed by localization as a quadratic form in parameter space weighted by the trajectory Fisher information. This yields instance-optimal bounds that scale with the full data budget under a broad set of conditions. We instantiate our framework across four diverse case studies: a simple mixture of Markov chains, dependent linear regression under non-Gaussian noise, generalized linear models with non-monotonic activations, and linear-attention sequence models. In all cases, our bounds nearly match the instance-optimal rates from asymptotic normality, substantially improving over standard reductions.

Paper Structure

This paper contains 83 sections, 33 theorems, 484 equations, 1 table.

Key Result

Theorem 3.1

We have with probability at least $1-\delta$, where $\mathcal{N}_\infty(\mathcal{P},\varepsilon)$ is the $\varepsilon$-covering number of $\mathcal{P}$ in the max divergence.Specifically, a set $\mathcal{P}' \subseteq \mathcal{P}$ is an $\varepsilon$-covering in max divergence if for every $p \in \mathcal{P}$ there exists a $p' \in \mathcal{P}'

Theorems & Definitions (74)

  • Theorem 3.1: cf. foster2024behaviorcloning
  • Definition 3.2: Hellinger cover
  • Proposition 3.2
  • Definition 3.3: Max FI cover
  • Definition 3.4
  • Theorem 3.5
  • Remark 3.6
  • Remark 3.7
  • proof : Proof of \ref{['thm:hellinger_bound_MLE']}
  • Proposition 3.8
  • ...and 64 more