Table of Contents
Fetching ...

Data-Dependent Complexity of First-Order Methods for Binary Classification

Matthew Hough, Stephen A. Vavasis

TL;DR

The paper investigates data-dependent iteration complexity for binary classification tasks solved with first-order methods. It develops a FISTA-based approach for two problems: the Ellipsoid Separation Problem (ESP) and soft-margin SVM, deriving data-driven stopping criteria that rely on geometric data properties rather than worst-case algorithmic constants. For ESP, a dual SOCP formulation lets FISTA yield a separating hyperplane via its residual, with an explicit upper bound on iterations scaling as $ abla abla$-style terms and a perturbation-based interpretation of separability. For SVM, a strongly concave perturbed dual ensures unique minimizers and enables efficient identification of well-classified points and a hyperplane separating them, with empirical results showing competitive runtimes against LIBSVM and LIBLINEAR and speedups from early stopping. Overall, the work demonstrates practical, data-dependent stopping rules that accelerate large-scale binary classification while providing theoretical guarantees tied to data geometry.

Abstract

Large-scale problems in data science are often modeled with optimization, and the optimization model is usually solved with first-order methods that may converge at a sublinear rate. Therefore, it is of interest to terminate the optimization algorithm as soon as the underlying data science task is accomplished. We consider FISTA for solving two binary classification problems: the ellipsoid separation problem (ESP), and the soft-margin support-vector machine (SVM). For the ESP, we cast the dual second-order cone program into a form amenable to FISTA and show that the FISTA residual converges to the infimal displacement vector of the primal-dual hybrid gradient (PDHG) algorithm, that directly encodes a separating hyperplane. We further derive a data-dependent iteration upper bound scaling as $\mathcal{O}(1/δ_{\mathcal{A}}^2)$, where $δ_{\mathcal{A}}$ is the minimal perturbation that destroys separability. For the SVM, we propose a strongly-concave perturbed dual that admits efficient FISTA updates under a linear time projection scheme, and with our parameter choices, the objective has small condition number, enabling rapid convergence. We prove that, under a reasonable data model, early-stopped iterates identify well-classified points and yield a hyperplane that exactly separates them, where the accuracy required of the dual iterate is governed by geometric properties of the data. In particular, the proposed early-stopping criteria diminish the need for hard-to-select tolerance-based stopping conditions. Our numerical experiments on ESP instances derived from MNIST data and on soft-margin SVM benchmarks indicate competitive runtimes and substantial speedups from stopping early.

Data-Dependent Complexity of First-Order Methods for Binary Classification

TL;DR

The paper investigates data-dependent iteration complexity for binary classification tasks solved with first-order methods. It develops a FISTA-based approach for two problems: the Ellipsoid Separation Problem (ESP) and soft-margin SVM, deriving data-driven stopping criteria that rely on geometric data properties rather than worst-case algorithmic constants. For ESP, a dual SOCP formulation lets FISTA yield a separating hyperplane via its residual, with an explicit upper bound on iterations scaling as -style terms and a perturbation-based interpretation of separability. For SVM, a strongly concave perturbed dual ensures unique minimizers and enables efficient identification of well-classified points and a hyperplane separating them, with empirical results showing competitive runtimes against LIBSVM and LIBLINEAR and speedups from early stopping. Overall, the work demonstrates practical, data-dependent stopping rules that accelerate large-scale binary classification while providing theoretical guarantees tied to data geometry.

Abstract

Large-scale problems in data science are often modeled with optimization, and the optimization model is usually solved with first-order methods that may converge at a sublinear rate. Therefore, it is of interest to terminate the optimization algorithm as soon as the underlying data science task is accomplished. We consider FISTA for solving two binary classification problems: the ellipsoid separation problem (ESP), and the soft-margin support-vector machine (SVM). For the ESP, we cast the dual second-order cone program into a form amenable to FISTA and show that the FISTA residual converges to the infimal displacement vector of the primal-dual hybrid gradient (PDHG) algorithm, that directly encodes a separating hyperplane. We further derive a data-dependent iteration upper bound scaling as , where is the minimal perturbation that destroys separability. For the SVM, we propose a strongly-concave perturbed dual that admits efficient FISTA updates under a linear time projection scheme, and with our parameter choices, the objective has small condition number, enabling rapid convergence. We prove that, under a reasonable data model, early-stopped iterates identify well-classified points and yield a hyperplane that exactly separates them, where the accuracy required of the dual iterate is governed by geometric properties of the data. In particular, the proposed early-stopping criteria diminish the need for hard-to-select tolerance-based stopping conditions. Our numerical experiments on ESP instances derived from MNIST data and on soft-margin SVM benchmarks indicate competitive runtimes and substantial speedups from stopping early.

Paper Structure

This paper contains 27 sections, 30 theorems, 162 equations, 4 figures, 3 tables.

Key Result

Lemma 2.1

Consider the ellipsoid $E:=\{\bm{z}:\Vert A^{-1}(\bm{z}-\bm{c})\Vert\le 1\}$, where $A \in {\mathbb{R}}^{d\times d}, \bm{c} \in {\mathbb{R}}^d$, and $s \in {\mathbb{R}}$. Let $\bm{w} \in {\mathbb{R}}^d\setminus\{\bm{0}\}$. Then $E$ lies in a halfspace $H:=\{\bm{z}:\bm{w}^T\bm{z}\le -s\}$ if and only

Figures (4)

  • Figure 1: An illustration of the assumption on the data, where we have two balls of well-classified points centered at $-1$ and $+1$, and a larger ball containing the two, which also includes the noisy points.
  • Figure 2: A plot of $\psi(\theta)$, where $n=100$, $\gamma = 64/n, \mu = n/128$.
  • Figure 3: Plot showing the number of iterations before obtaining a separating hyperplane vs the distance between the centers of two ellipsoids in two dimensions. A vertical red dashed line is used to mark the first distance at which the ESP was able to be solved.
  • Figure 4: Plots depicting $\lVert\bm{y}_k - \mathrm{proj}_{\Omega}(\bm{y}_k - \frac{1}{L}\nabla g(\bm{y}_k))\rVert$ for $k \in [0,10000]$.

Theorems & Definitions (56)

  • Lemma 2.1
  • proof
  • Theorem 2.1
  • proof
  • Lemma 2.2: Jiang2023
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • proof
  • Corollary 2.1
  • ...and 46 more