Convergence analysis of online algorithms for vector-valued kernel regression
Michael Griebel, Peter Oswald
TL;DR
This work develops a convergence theory for online, regularized vector-valued kernel regression in an RKHS framework. By formulating the problem through a feature-map-driven RKHS H and a covariance-like operator P_ρ on a smoothness scale V_{P_ρ}^s, it proves an order-optimal decay rate in expectation for the RKHS error: E(||u−u^{(m)}||_V^2)≤C^2(m+1)^{-s/(2+s)} for 0<s≤1 under mild, verifiable conditions. The analysis leverages a Schwarz-iteration-inspired update and elementary Hilbert-space techniques, with the rate reflecting both the regression function smoothness and the noise level; a divergence result shows the necessity of key assumptions, and a special-cons case provides explicit, near-optimal rates in a diagonal coefficient-learning setting. These results generalize prior scalar-valued analyses to the vector-valued case without strong spectral or probabilistic prerequisites, offering a principled foundation for online multitask and functional learning. The study also clarifies limitations regarding L^2_ρ convergence and highlights practical parameter regimes (e.g., t≈2/3) that optimize decay.
Abstract
We consider the problem of approximating the regression function $f_μ:\, Ω\to Y$ from noisy $μ$-distributed vector-valued data $(ω_m,y_m)\inΩ\times Y$ by an online learning algorithm using a reproducing kernel Hilbert space $H$ (RKHS) as prior. In an online algorithm, i.i.d. samples become available one by one via a random process and are successively processed to build approximations to the regression function. Assuming that the regression function essentially belongs to $H$ (soft learning scenario), we provide estimates for the expected squared error in the RKHS norm of the approximations $f^{(m)}\in H$ obtained by a standard regularized online approximation algorithm. In particular, we show an order-optimal estimate $$ \mathbb{E}(\|ε^{(m)}\|_H^2)\le C (m+1)^{-s/(2+s)},\qquad m=1,2,\ldots, $$ where $ε^{(m)}$ denotes the error term after $m$ processed data, the parameter $0<s\leq 1$ expresses an additional smoothness assumption on the regression function, and the constant $C$ depends on the variance of the input noise, the smoothness of the regression function, and other parameters of the algorithm. The proof, which is inspired by results on Schwarz iterative methods in the noiseless case, uses only elementary Hilbert space techniques and minimal assumptions on the noise, the feature map that defines $H$ and the associated covariance operator.
