On the Consistency of Kernel Methods with Dependent Observations
Pierre-François Massiani, Sebastian Trimpe, Friedrich Solowjow
TL;DR
This work addresses learning with dependent observations by introducing empirical weak convergence (EWC), which allows the asymptotic data distribution to be a random measure. It then develops a foundation for kernel methods under EWC, proving that kernel mean embeddings converge to the embedding of the random limit and that SVMs with Hilbert-space outputs remain consistent when the data-generating process satisfies EWC. The theory generalizes beyond i.i.d. and standard mixing assumptions to include infinite-dimensional outputs (e.g., CKME) and dynamical-system-type data, by leveraging convergence of random measures and weak topology. The results demonstrate that kernel methods can be consistent under broader, less restrictive conditions, while clarifying the trade-offs with differentiability requirements and the lack of explicit rates. Overall, the paper lays a rigorous foundation for learning with dependent data and extends statistical learning theory to Hilbert-valued outputs.
Abstract
The consistency of a learning method is usually established under the assumption that the observations are a realization of an independent and identically distributed (i.i.d.) or mixing process. Yet, kernel methods such as support vector machines (SVMs), Gaussian processes, or conditional kernel mean embeddings (CKMEs) all give excellent performance under sampling schemes that are obviously non-i.i.d., such as when data comes from a dynamical system. We propose the new notion of empirical weak convergence (EWC) as a general assumption explaining such phenomena for kernel methods. It assumes the existence of a random asymptotic data distribution and is a strict weakening of previous assumptions in the field. Our main results then establish consistency of SVMs, kernel mean embeddings, and general Hilbert-space valued empirical expectations with EWC data. Our analysis holds for both finite- and infinite-dimensional outputs, as we extend classical results of statistical learning to the latter case. In particular, it is also applicable to CKMEs. Overall, our results open new classes of processes to statistical learning and can serve as a foundation for a theory of learning beyond i.i.d. and mixing.
