A Smoothing Algorithm for l1 Support Vector Machines

Ibrahim Emirahmetoglu; Jeffrey Hajewski; Suely Oliveira; David E. Stewart

A Smoothing Algorithm for l1 Support Vector Machines

Ibrahim Emirahmetoglu, Jeffrey Hajewski, Suely Oliveira, David E. Stewart

TL;DR

The paper introduces SmSVM, a method for efficiently solving soft-margin SVMs with an ℓ¹ penalty on very large datasets by smoothing the hinge loss and employing an active-set strategy for the ℓ¹ term. The approach yields well-behaved Hessian approximations and a Newton-based solver with guarded line search, achieving a provably bounded number of Newton steps per smoothing parameter reduction and overall data passes that scale polylogarithmically with α. Theoretical analysis across opening, midgame, and endgame regimes provides convergence guarantees and practical guidance, while extensive experiments on real and synthetic data demonstrate competitive test accuracy and favorable training times, especially in tall and large-scale settings. The work shows how combining smoothing, active-set sparsity, and second-order optimization can robustly handle large-scale, sparse SVMs with strong empirical performance. It highlights potential for further speedups via GPU acceleration and distributed computing, making ℓ¹ SVMs more viable for industrial-scale data.

Abstract

A smoothing algorithm is presented for solving the soft-margin Support Vector Machine (SVM) optimization problem with an $\ell^{1}$ penalty. This algorithm is designed to require a modest number of passes over the data, which is an important measure of its cost for very large datasets. The algorithm uses smoothing for the hinge-loss function, and an active set approach for the $\ell^{1}$ penalty. The smoothing parameter $α$ is initially large, but typically halved when the smoothed problem is solved to sufficient accuracy. Convergence theory is presented that shows $\mathcal{O}(1+\log(1+\log_+(1/α)))$ guarded Newton steps for each value of $α$ except for asymptotic bands $α=Θ(1)$ and $α=Θ(1/N)$, with only one Newton step provided $ηα\gg1/N$, where $N$ is the number of data points and the stopping criterion that the predicted reduction is less than $ηα$. The experimental results show that our algorithm is capable of strong test accuracy without sacrificing training speed.

A Smoothing Algorithm for l1 Support Vector Machines

TL;DR

Abstract

A smoothing algorithm is presented for solving the soft-margin Support Vector Machine (SVM) optimization problem with an

penalty. This algorithm is designed to require a modest number of passes over the data, which is an important measure of its cost for very large datasets. The algorithm uses smoothing for the hinge-loss function, and an active set approach for the

penalty. The smoothing parameter

is initially large, but typically halved when the smoothed problem is solved to sufficient accuracy. Convergence theory is presented that shows

guarded Newton steps for each value of

except for asymptotic bands

and

, with only one Newton step provided

, where

is the number of data points and the stopping criterion that the predicted reduction is less than

. The experimental results show that our algorithm is capable of strong test accuracy without sacrificing training speed.

Paper Structure (28 sections, 8 theorems, 37 equations, 1 figure, 3 tables, 5 algorithms)

This paper contains 28 sections, 8 theorems, 37 equations, 1 figure, 3 tables, 5 algorithms.

Introduction
Algorithm development
Smoothing the hinge-loss function and Hessian matrices
Smoothing algorithm
Issues with the line search
Convergence rate
Assumptions
Opening
Midgame
Lipschitz continuity of $\mathbb{E}[\mathop{\mathrm{Hess}}\nolimits \widehat{f}_{\alpha}(\bm{w};\mathcal{D})]$ in $\bm{w}$
Lipschitz constants for $\mathop{\mathrm{Hess}}\nolimits\widehat{f}_{\alpha}(\bm{w};\mathcal{D})$
Bound $0\prec aI\preceq\left\Vert \bm{w}\right\Vert \mathbb{E}\left[\mathop{\mathrm{Hess}}\nolimits\widehat{f}_{\alpha}(\bm{w};\mathcal{D})\right]\preceq bI$ for all sufficiently small $\alpha>0$
Variation of $\mathop{\mathrm{Hess}}\nolimits\widehat{f}_{\alpha}(\bm{w};\mathcal{D})$
Variance of gradients
Effects of changing $\alpha$
...and 13 more sections

Key Result

Lemma 3.1

Assuming that $\bm{x}\mapsto p(\bm{x},y)$ is Lipschitz, for any compact set $C$ not containing $\bm{0}$, there is a constant $L$ independent of $\alpha$ such that

Figures (1)

Figure 1: Plot of $N^{-1}\sum_{i=1}^{N}\varphi(x_{i},y_{i};w)$ for randomly chosen data ($n=200$)

Theorems & Definitions (8)

Lemma 3.1
Lemma 3.2
Lemma 3.3
Lemma 3.4
Lemma 3.5
Lemma 3.6
Lemma 3.7
Lemma 3.8

A Smoothing Algorithm for l1 Support Vector Machines

TL;DR

Abstract

A Smoothing Algorithm for l1 Support Vector Machines

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (8)