On the Nystrom Approximation for Preconditioning in Kernel Machines
Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin
TL;DR
This work tackles the conditioning challenge in scalable kernel methods by analyzing Nyström-based approximations to spectral preconditioners for gradient-based optimization. It establishes a rigorous bound showing that a Nyström sample size of $s = \Omega\left(\frac{\log^4 n}{\varepsilon^4}\right)$ suffices to ensure $\kappa(\mathcal{P}_{s,q}^{1/2}\mathcal{K}\mathcal{P}_{s,q}^{1/2}) \le (1+\varepsilon)^4 \kappa(\mathcal{P}_q\mathcal{K})$ with high probability, i.e., near-exact speed-ups with substantially reduced storage and setup costs. The results provide concrete guidance for designing scalable preconditioned kernel methods (e.g., EigenPro variants) on large datasets by balancing sample size, accuracy, and computational overhead. Practically, this work supports efficient training of kernel machines on big data through Nyström-based spectral preconditioning and contributes to the theoretical foundations of scalable kernel optimization.
Abstract
Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.
