Table of Contents
Fetching ...

Joker: Joint Optimization Framework for Lightweight Kernel Machines

Junhong Zhang, Zhihui Lai

TL;DR

Joker proposes a unified, memory-efficient framework for large-scale kernel machines by formulating a dual optimization problem that accommodates a broad class of convex losses and by solving it with a Dual Block Coordinate Descent method enhanced with a Trust Region. To tackle memory bottlenecks, Joker employs Random Fourier Features for inexact kernel representations, reducing per-iteration cost while maintaining competitive accuracy. The approach yields substantial memory savings (up to ~90-95% in reported settings) and favorable training times compared with state-of-the-art baselines, across KRR, SVM, and KLR tasks on billion-scale datasets. This enables practical deployment of lightweight kernel methods on commodity hardware without sacrificing model diversity or performance.

Abstract

Kernel methods are powerful tools for nonlinear learning with well-established theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we propose Joker, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show that Joker saves up to 90\% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.

Joker: Joint Optimization Framework for Lightweight Kernel Machines

TL;DR

Joker proposes a unified, memory-efficient framework for large-scale kernel machines by formulating a dual optimization problem that accommodates a broad class of convex losses and by solving it with a Dual Block Coordinate Descent method enhanced with a Trust Region. To tackle memory bottlenecks, Joker employs Random Fourier Features for inexact kernel representations, reducing per-iteration cost while maintaining competitive accuracy. The approach yields substantial memory savings (up to ~90-95% in reported settings) and favorable training times compared with state-of-the-art baselines, across KRR, SVM, and KLR tasks on billion-scale datasets. This enables practical deployment of lightweight kernel methods on commodity hardware without sacrificing model diversity or performance.

Abstract

Kernel methods are powerful tools for nonlinear learning with well-established theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we propose Joker, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show that Joker saves up to 90\% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.

Paper Structure

This paper contains 24 sections, 2 theorems, 28 equations, 3 figures, 10 tables.

Key Result

Theorem 1

Let $\xi_{\bm y}(\cdot):{\mathbb R}\mapsto{\mathbb R}_+$ defined as $\xi_y(u):=\ell(y,u)$. Then the optimal solution of eq:base-model is given by where $\Omega=\{{\bm\alpha}:-\lambda\alpha_i\in\mathsf{dom}~{\xi_{y_i}^*},i\in[n]\}$ is the feasible region, $\xi^*_{y}(\cdot)$ is the Fenchel conjugate of $\xi_{y}(\cdot)$, and ${\bm K}$ is kernel matrix with $K_{ij}=\langle{\bm\varphi}({\bm x}_i),{\bm

Figures (3)

  • Figure 1: Performance versus the model size on HIGGS.
  • Figure 2: Test performance versus time.
  • Figure 3: The primal, dual objectives, and validation loss of Joker versus the iteration steps.

Theorems & Definitions (4)

  • Theorem 1
  • Proposition 2
  • proof
  • proof