Table of Contents
Fetching ...

Iterative regularization in classification via hinge loss diagonal descent

Vassilis Apidopoulos, Tomaso Poggio, Lorenzo Rosasco, Silvia Villa

TL;DR

This paper develops an iterative regularization approach based on the use of the hinge loss function for a family of algorithms for which it proves convergence as well as rates of convergence and stability results for a suitable classification noise model.

Abstract

Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning. On the one hand, it allows to design efficient algorithms controlling at the same time numerical and statistical accuracy. On the other hand it allows to shed light on the learning curves observed while training neural networks. In this paper, we focus on iterative regularization in the context of classification. After contrasting this setting with that of linear inverse problems, we develop an iterative regularization approach based on the use of the hinge loss function. More precisely we consider a diagonal approach for a family of algorithms for which we prove convergence as well as rates of convergence and stability results for a suitable classification noise model. Our approach compares favorably with other alternatives, as confirmed by numerical simulations.

Iterative regularization in classification via hinge loss diagonal descent

TL;DR

This paper develops an iterative regularization approach based on the use of the hinge loss function for a family of algorithms for which it proves convergence as well as rates of convergence and stability results for a suitable classification noise model.

Abstract

Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning. On the one hand, it allows to design efficient algorithms controlling at the same time numerical and statistical accuracy. On the other hand it allows to shed light on the learning curves observed while training neural networks. In this paper, we focus on iterative regularization in the context of classification. After contrasting this setting with that of linear inverse problems, we develop an iterative regularization approach based on the use of the hinge loss function. More precisely we consider a diagonal approach for a family of algorithms for which we prove convergence as well as rates of convergence and stability results for a suitable classification noise model. Our approach compares favorably with other alternatives, as confirmed by numerical simulations.
Paper Structure (23 sections, 15 theorems, 148 equations, 6 figures, 2 algorithms)

This paper contains 23 sections, 15 theorems, 148 equations, 6 figures, 2 algorithms.

Key Result

Lemma 3.1

Problem max_sphere is equivalent to Problem min_norm. In particular, if $w_\ast$ is a solution of Problem min_norm then is a solution of Problem max_sphere and $M(w_{+})=\frac{1}{\left\lVert {w_{\ast}} \right\rVert}$. Moreover, if $w_+$ is the solution of Problem max_sphere then is a solution of Problem min_norm. Further, it holds that $M(w_{\ast})=1$

Figures (6)

  • Figure 1: The exponential (blue), logistic (red) and hinge (green) loss function.
  • Figure 2: Data-set consisting of $80$ labeled points with given support vector-points $\pm(\frac{1}{2},\frac{3}{2})$ and $\pm(\frac{3}{2},\frac{1}{2})$. In dashed lines the (overlapping) max-margin separating hyperplanes formed by the last iterate of every scheme (Algorithms \ref{['algodualprojGD']} and \ref{['algodualinertialGD']} with $\alpha=10$, $30$ and $50$ respectively).
  • Figure 3: Values of the normalized error gap $\lvert{{w_{t}}/{\left\lVert {w_{t}} \right\rVert}-{w_{\ast}}/{\left\lVert {w_{\ast}} \right\rVert}}\rvert$ (first figure), the normalized margin gap ${M(w_{\ast})}{/\left\lVert {w_{\ast}} \right\rVert}-{M(w_{t})}/{\left\lVert {w_{t}} \right\rVert}$ (second figure) and the normalized angle gap $1-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert {w_{t}} \right\rVert\left\lVert {w_{\ast}} \right\rVert}$ (third figure), as a function of the iterations $t$. Here we illustrate the performance of Algorithms \ref{['algodualprojGD']}(green) and \ref{['algodualinertialGD']} with $3$ different choices for the parameter $\alpha$ in Algorithm \ref{['algodualinertialGD']}, $\alpha=10$ (red), $\alpha=30$ (magenta) and $\alpha=50$ (blue). As it appears in this example the choice of $\alpha$ decisively affects the performance of algorithm \ref{['algodualinertialGD']}.
  • Figure 4: Margin and test error performance of Algorithms \ref{['algodualprojGD']} and \ref{['algodualinertialGD']} in noisy dataset with $\lambda_{t}={\lambda_{0}}/{t}$, for different initial values of $\lambda_{0}$ ($\lambda_{0}=100$ in green, $\lambda_{0}=10$ in red, $\lambda_{0}=1$ in blue and $\lambda_{0}=0.01$ in magenta). Each row corresponds to a different noise level ($\%$ of flipped labels) starting with $0\%$ (first row), $10\%$ (second row) and $20\%$ (third row). The first and third column illustrate the margin and test error of Algorithm \ref{['algodualprojGD']} respectively, while the second and the fourth column correspond to the margin (resp. test error) of Algorithm \ref{['algodualinertialGD']}.
  • Figure 5: Margin and test error performance of Algorithms \ref{['algodualprojGD']} and \ref{['algodualinertialGD']} in noisy dataset for different decay rates of $\lambda_{t}$ ($\lambda_{t}=\frac{\lambda_{0}}{\log(t)}$ in green, $\lambda_{t}=\frac{\lambda_{0}}{\sqrt{t}}$ in red, $\lambda_{t}=\frac{\lambda_{0}}{t}$ in blue, $\lambda_{t}=\frac{\lambda_{0}}{t^2}$ in magenta and $\lambda_{t}=\frac{\lambda_{0}}{2^{t}}$ in khaki), where $\lambda_{0}=8$. Each row corresponds to a different noise level ($\%$ of flipped labels) starting with $0\%$ (first row), $10\%$ (second row) and $20\%$ (third row). The first and third column illustrate the margin and test error of Algorithm \ref{['algodualprojGD']} respectively, while the second and the fourth column correspond to the margin (resp. test error) of Algorithm \ref{['algodualinertialGD']}.
  • ...and 1 more figures

Theorems & Definitions (38)

  • Remark 2.1: Regression in RKHS steinwart2008support
  • Remark 2.2: Loss and regularizers
  • Lemma 3.1
  • Remark 4.1
  • Remark 4.2: Implicit regularization via homotopic subgradient
  • Remark 4.3: Hard-SVM
  • Theorem 5.1
  • Theorem 5.2
  • Remark 5.1
  • Remark 5.2: Error metrics
  • ...and 28 more