Table of Contents
Fetching ...

The Nyström method for convex loss functions

Andrea Della Vecchia, Ernesto De Vito, Jaouad Mourtada, Lorenzo Rosasco

TL;DR

This paper investigates an extension of classical empirical risk minimization, where the hypothesis space consists of a random subspace within a given Hilbert space within a given Hilbert space, and examines the Nystr\"om method where the subspaces are defined by a random subset of the data.

Abstract

We investigate an extension of classical empirical risk minimization, where the hypothesis space consists of a random subspace within a given Hilbert space. Specifically, we examine the Nyström method where the subspaces are defined by a random subset of the data. This approach recovers Nyström approximations used in kernel methods as a specific case. Using random subspaces naturally leads to computational advantages, but a key question is whether it compromises the learning accuracy. Recently, the tradeoffs between statistics and computation have been explored for the square loss and self-concordant losses, such as the logistic loss. In this paper, we extend these analyses to general convex Lipschitz losses, which may lack smoothness, such as the hinge loss used in support vector machines. Our main results show the existence of various scenarios where computational gains can be achieved without sacrificing learning performance. When specialized to smooth loss functions, our analysis recovers most previous results. Moreover, it allows to consider classification problems and translate the surrogate risk bounds into classification error bounds. Indeed, this gives the opportunity to compare the effect of Nyström approximations when combined with different loss functions such as the hinge or the square loss.

The Nyström method for convex loss functions

TL;DR

This paper investigates an extension of classical empirical risk minimization, where the hypothesis space consists of a random subspace within a given Hilbert space within a given Hilbert space, and examines the Nystr\"om method where the subspaces are defined by a random subset of the data.

Abstract

We investigate an extension of classical empirical risk minimization, where the hypothesis space consists of a random subspace within a given Hilbert space. Specifically, we examine the Nyström method where the subspaces are defined by a random subset of the data. This approach recovers Nyström approximations used in kernel methods as a specific case. Using random subspaces naturally leads to computational advantages, but a key question is whether it compromises the learning accuracy. Recently, the tradeoffs between statistics and computation have been explored for the square loss and self-concordant losses, such as the logistic loss. In this paper, we extend these analyses to general convex Lipschitz losses, which may lack smoothness, such as the hinge loss used in support vector machines. Our main results show the existence of various scenarios where computational gains can be achieved without sacrificing learning performance. When specialized to smooth loss functions, our analysis recovers most previous results. Moreover, it allows to consider classification problems and translate the surrogate risk bounds into classification error bounds. Indeed, this gives the opportunity to compare the effect of Nyström approximations when combined with different loss functions such as the hinge or the square loss.

Paper Structure

This paper contains 40 sections, 31 theorems, 239 equations, 3 figures, 8 tables.

Key Result

Theorem 1

Under Assumptions ass: sub-gaussian and ass:loss, fix $\lambda>0$ and $0<\delta<1$. Then, with probability at least $1- \delta$, where $C$ and $G$ are the constants defined respectively in def: subgauss and eq:5, $D$ is a numerical constant and

Figures (3)

  • Figure 1: Comparison between the number of Nyström points needed by square and hinge loss to get a fixed common rate: the plots above show $\mu_{\text{square}}-\mu_{\text{hinge}}$, where $0\leqslant\mu\leqslant1$ is the exponent controlling $m$, i.e. $m\asymp n^{\mu}$. Light colours represent then the regimes where hinge loss is cheaper than square loss.
  • Figure 2: The graphs above are obtained from SUSY dataset: on the top left we show how c-err measure changes for different choices of $\lambda$ parameter; top right figure focuses on the stability of the algorithm varying $\lambda$; on the bottom the combined behavior is presented with a heatmap.
  • Figure 3: Eigenvalues decay of the empirical covariance matrix for Mnist binary, CIFAR and SUSY datasets.

Theorems & Definitions (44)

  • Example 1
  • Example 2
  • Example 3: Representer theorem for kernel machines
  • Example 4: Hinge loss & SVM
  • Theorem 1
  • Theorem 2
  • Remark 3
  • Definition 4: Approximate leverage scores sampling (ALS)
  • Remark 5
  • Example 5: Kernel methods and Nyström approximations
  • ...and 34 more