Table of Contents
Fetching ...

Efficient Convex Algorithms for Universal Kernel Learning

Aleksandr Talitckii, Brendon K. Colbert, Matthew M. Peet

TL;DR

This paper poses the problem of learning semiseparable kernels as a minimax optimization problem and proposes a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches.

Abstract

The accuracy and complexity of machine learning algorithms based on kernel optimization are determined by the set of kernels over which they are able to optimize. An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy). Recently, a framework was proposed for using positive matrices to parameterize a class of positive semi-separable kernels. Although this class can be shown to meet all three criteria, previous algorithms for optimization of such kernels were limited to classification and furthermore relied on computationally complex Semidefinite Programming (SDP) algorithms. In this paper, we pose the problem of learning semiseparable kernels as a minimax optimization problem and propose a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches. Furthermore, we provide an efficient implementation of this algorithm for both classification and regression -- an implementation which enables us to solve problems with 100 features and up to 30,000 datums. Finally, when applied to benchmark data, the algorithm demonstrates the potential for significant improvement in accuracy over typical (but non-convex) approaches such as Neural Nets and Random Forest with similar or better computation time.

Efficient Convex Algorithms for Universal Kernel Learning

TL;DR

This paper poses the problem of learning semiseparable kernels as a minimax optimization problem and proposes a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches.

Abstract

The accuracy and complexity of machine learning algorithms based on kernel optimization are determined by the set of kernels over which they are able to optimize. An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy). Recently, a framework was proposed for using positive matrices to parameterize a class of positive semi-separable kernels. Although this class can be shown to meet all three criteria, previous algorithms for optimization of such kernels were limited to classification and furthermore relied on computationally complex Semidefinite Programming (SDP) algorithms. In this paper, we pose the problem of learning semiseparable kernels as a minimax optimization problem and propose a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches. Furthermore, we provide an efficient implementation of this algorithm for both classification and regression -- an implementation which enables us to solve problems with 100 features and up to 30,000 datums. Finally, when applied to benchmark data, the algorithm demonstrates the potential for significant improvement in accuracy over typical (but non-convex) approaches such as Neural Nets and Random Forest with similar or better computation time.
Paper Structure (36 sections, 15 theorems, 80 equations, 5 figures, 4 tables, 6 algorithms)

This paper contains 36 sections, 15 theorems, 80 equations, 5 figures, 4 tables, 6 algorithms.

Key Result

Lemma 5

Let $N$ be any bounded measurable function $N: Y \times X \rightarrow \mathbb{R}^{n_P}$ on compact $X$ and $Y$. If we define then any $k\in \mathcal{K}$ is a positive kernel function and $\mathcal{K}$ is tractable.

Figures (5)

  • Figure 1: Convergence rates of the Franke-Wolfe algorithm \ref{['FWTKL']} and the alternative APD algorithm described in Subsection \ref{['sec:APD']}. In (a) we plot the gap between $OPT\_A(P_k)$ and $OPT\_P(\alpha_k)$ of the Franke-Wolfe Algorithm \ref{['FWTKL']} vs. iteration number; in (b) we again plot the gap between $OPT\_A(P_k)$ and $OPT\_P(\alpha_k)$ vs. iteration number for the APD Algorithm and in (c) we plot the boosted algorithm. Both demonstrate sublinear convergence, but with enhanced performance for the hybrid algorithm.
  • Figure 2: Per-iteration complexity of the proposed FW algorithm \ref{['FWTKL']} and the alternative APD algorithm described in Subsection \ref{['sec:APD']}. In (a) and (c) we find log-log plots of iteration complexity of the Franke Wolfe (FW) TKL classification and regression algorithms, respectively, as a function of $m$ for several values of $n_P$. Here $m$ is number of samples and $n_P^{2}$ is the number of parameters in $\mathcal{K}$, so that $P \in \mathbb{S}^{n_P}$. In (b) and (d) we find log-log plots of iteration complexity of the Accelerated Primal Dual (APD) for classification and regression, respectively as a function of $m$ for several values of $n_P$. In both cases, best linear fit is included for reference.
  • Figure 3: Subfigure (a) shows an 3D representation of the section of the Grand Canyon to be fitted. In (b) we plot elevation data of this section of the Grand Canyon. In (c) we plot the predictor for a hand-tuned Gaussian kernel. In (d) we plot the predictor from Algorithm \ref{['FWTKL']} for $d=2$.
  • Figure 4: The number of iterations of SVM subproblem required to achieve the desired tolerance $\varepsilon=0.1$ as a function of the rank $P$. The SVM subproblem has been solved using LibSVM implementation. The red dots and error bars represent average number of iterations of the SVM algorithm and 95$\%$ confidence interval using 20 trials for a) regression problem for California Housing (CA) data set in pace1997sparse and b) for classification problem for Shill Bid data set in alzahrani2018scrapingalzahrani2020clustering. We also included the blue line, that indicates the best linear fit of the average number of iterations.
  • Figure : The Frank-Wolfe Algorithm for Matrices.

Theorems & Definitions (20)

  • Definition 1
  • Definition 2: scholkopf2001generalized
  • Definition 3
  • Definition 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Proposition 8: Danskin's Theorem
  • Lemma 9
  • Theorem 10
  • ...and 10 more