Table of Contents
Fetching ...

Fast training of large kernel models with delayed projections

Amirhesam Abedsoltan, Siyuan Ma, Parthe Pandit, Mikhail Belkin

TL;DR

This paper presents a new methodology for building kernel machines that can scale efficiently with both data size and model size and introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD), allowing the training of much larger models than was previously feasible.

Abstract

Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes--a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. We validate our algorithm, EigenPro4, across multiple datasets, demonstrating drastic training speed up over the existing methods while maintaining comparable or better classification accuracy.

Fast training of large kernel models with delayed projections

TL;DR

This paper presents a new methodology for building kernel machines that can scale efficiently with both data size and model size and introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD), allowing the training of much larger models than was previously feasible.

Abstract

Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes--a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. We validate our algorithm, EigenPro4, across multiple datasets, demonstrating drastic training speed up over the existing methods while maintaining comparable or better classification accuracy.

Paper Structure

This paper contains 37 sections, 2 theorems, 42 equations, 6 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

Consider any dataset $X, \bm{y}$ and a choice of model centers $Z$, with a kernel function $K: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$. Assume that $K(X,X)$ and $K(Z,X)$ are full Rank. Then, Algorithm alg:eigenpro4_exact converges to the following solution: Furthermore, if $\bm{y} = K(X,Z) \bm{\beta}^* + \bm{\xi}$, where $\bm{\xi}$ is a vector of independent centered random noise with $\

Figures (6)

  • Figure 1: Per epoch time comparison between different solvers. Performance in terms of classification test accuracy (indicated as percentages) is annotated next to each data point, showing that EP4 maintains superior or comparable performance across all model sizes. The detail of the experiment can be found in \ref{['appendix:expts']}.
  • Figure 2: Design of EigenPro4. An illustration of how batches of data are processed by the two algorithms. EigenPro3 involves an expensive projection step when processing every batch of data. EigenPro4 waits for multiple batches to be processed before running the projection step for all of them together. This reduces the amortized cost for processing each batch.
  • Figure 3: Performance and computational time comparison between EigenPro 4.0 ($T=11$) and EigenPro 3.0 (equivalent to $T=1$), highlighting the impact of the projection step on the performance of EigenPro 4.0. The detail of the experiment can be found in \ref{['appendix:expts']}.
  • Figure 4: Overview of iteration scheme for EigenPro 4. The figure illustrates how the model updates are performed over multiple iterations in EigenPro4. Weights are updated using batches, and gradients are accumulated iteratively until a projection step is executed. Temporary centers and weights are maintained during auxiliary iterations, which are merged through projection after a certain number of batches. This approach reduces the computational cost by accumulating gradients before performing the projection, leading to more efficient batch processing.
  • Figure 5: Multi-epoch performance and convergence comparison for EigenPro 3 and EigenPro 4. The detail of the experiment can be found in \ref{['appendix:expts']}.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1: Top-$q$ Eigensystem
  • Proposition 1
  • Proposition 2
  • proof