Table of Contents
Fetching ...

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

Zikai Xiong, Robert M. Freund

TL;DR

This work addresses the challenge of large-scale ERM where the cost of exact gradient evaluation scales with the number of observations $n$. It introduces TUFW, a Taylor-point Updating Frank-Wolfe method that replaces full gradients with second-order Taylor-approximated gradients around per-example Taylor points, coupled with stochastic or deterministic batch-update rules to control gradient computations. The authors prove convergence guarantees for both convex and non-convex losses, derive specialized rates for ERM with linear prediction, and develop an adaptive step-size variant with competitive performance. Computational experiments on real datasets demonstrate substantial speed-ups over existing Frank-Wolfe methods, highlighting the practical impact of reducing $n$-dependence while preserving or approaching optimal convergence rates.

Abstract

The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

TL;DR

This work addresses the challenge of large-scale ERM where the cost of exact gradient evaluation scales with the number of observations . It introduces TUFW, a Taylor-point Updating Frank-Wolfe method that replaces full gradients with second-order Taylor-approximated gradients around per-example Taylor points, coupled with stochastic or deterministic batch-update rules to control gradient computations. The authors prove convergence guarantees for both convex and non-convex losses, derive specialized rates for ERM with linear prediction, and develop an adaptive step-size variant with competitive performance. Computational experiments on real datasets demonstrate substantial speed-ups over existing Frank-Wolfe methods, highlighting the practical impact of reducing -dependence while preserving or approaching optimal convergence rates.

Abstract

The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations . This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on , we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance is sufficiently small, our methods are able to simultaneously reduce the dependence on large while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
Paper Structure (19 sections, 28 theorems, 95 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 28 theorems, 95 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Proposition 2.1

\newlabelcamille0 Let $\bar{\beta}_k := | {\@fontswitch{}{\mathcal{}} B}_k|$. Then the updating of the second-order Taylor approximation model requires $O(\bar{\beta}_{k} p^2)$ flops and the computation of $g^k$ can be done via a matrix-vector product, which takes at most $O(p^2)$ flops.

Figures (2)

  • Figure 1: Performance of TUFW with $\mathit{{Rule\mathrm{-}SBD}}\sqrt{k}$ and $\mathit{{Rule\mathrm{-}DBD}}\sqrt{k}$ and the standard Frank-Wolfe method, with and without adaptive step-sizes, on the logistic regression problem \ref{['paul']}, on the dataset a9a.
  • Figure 2: Performance of TUFW with $\mathit{{Rule\mathrm{-}SBD}}\sqrt[4]{K}$ and $\mathit{{Rule\mathrm{-}DBD}}\sqrt[4]{K}$ and the standard Frank-Wolfe method, with and without adaptive step-sizes, on the binary classification problem \ref{['leslie']}, on the dataset a9a.

Theorems & Definitions (55)

  • Proposition 2.1
  • Proof 1: Proof of \ref{['camille']}
  • Definition 3.1
  • Proposition 3.2
  • Proof 2: Proof of \ref{['karljunior']}
  • Theorem 3.3
  • Corollary 3.4
  • Definition 3.5
  • Proposition 3.6
  • Proof 3: Proof of \ref{['karlsenior']}
  • ...and 45 more