Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

Zikai Xiong; Robert M. Freund

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

Zikai Xiong, Robert M. Freund

TL;DR

This work addresses the challenge of large-scale ERM where the cost of exact gradient evaluation scales with the number of observations $n$. It introduces TUFW, a Taylor-point Updating Frank-Wolfe method that replaces full gradients with second-order Taylor-approximated gradients around per-example Taylor points, coupled with stochastic or deterministic batch-update rules to control gradient computations. The authors prove convergence guarantees for both convex and non-convex losses, derive specialized rates for ERM with linear prediction, and develop an adaptive step-size variant with competitive performance. Computational experiments on real datasets demonstrate substantial speed-ups over existing Frank-Wolfe methods, highlighting the practical impact of reducing $n$-dependence while preserving or approaching optimal convergence rates.

Abstract

The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

TL;DR

This work addresses the challenge of large-scale ERM where the cost of exact gradient evaluation scales with the number of observations

. It introduces TUFW, a Taylor-point Updating Frank-Wolfe method that replaces full gradients with second-order Taylor-approximated gradients around per-example Taylor points, coupled with stochastic or deterministic batch-update rules to control gradient computations. The authors prove convergence guarantees for both convex and non-convex losses, derive specialized rates for ERM with linear prediction, and develop an adaptive step-size variant with competitive performance. Computational experiments on real datasets demonstrate substantial speed-ups over existing Frank-Wolfe methods, highlighting the practical impact of reducing

-dependence while preserving or approaching optimal convergence rates.

Abstract

. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on

, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance

is sufficiently small, our methods are able to simultaneously reduce the dependence on large

while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.

Paper Structure (19 sections, 28 theorems, 95 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 28 theorems, 95 equations, 2 figures, 4 tables, 1 algorithm.

Introduction
Examples of oft-used loss functions with smooth second derivatives
Quadratic losses
Logistic regression for binary classification
Binary linear classification with non-convex losses mei2018landscape
Loss functions with "linear prediction"
Motivation and literature review
Contributions
Outline
Notation
A Frank-Wolfe method with Taylor-point Updating (TUFW)
Convergence Guarantees for Convex Loss Functions
Proofs of results
Convergence Guarantees for Non-convex Loss Functions
Proofs of results
...and 4 more sections

Key Result

Proposition 2.1

\newlabelcamille0 Let $\bar{\beta}_k := | {\@fontswitch{}{\mathcal{}} B}_k|$. Then the updating of the second-order Taylor approximation model requires $O(\bar{\beta}_{k} p^2)$ flops and the computation of $g^k$ can be done via a matrix-vector product, which takes at most $O(p^2)$ flops.

Figures (2)

Figure 1: Performance of TUFW with $\mathit{{Rule\mathrm{-}SBD}}\sqrt{k}$ and $\mathit{{Rule\mathrm{-}DBD}}\sqrt{k}$ and the standard Frank-Wolfe method, with and without adaptive step-sizes, on the logistic regression problem \ref{['paul']}, on the dataset a9a.
Figure 2: Performance of TUFW with $\mathit{{Rule\mathrm{-}SBD}}\sqrt[4]{K}$ and $\mathit{{Rule\mathrm{-}DBD}}\sqrt[4]{K}$ and the standard Frank-Wolfe method, with and without adaptive step-sizes, on the binary classification problem \ref{['leslie']}, on the dataset a9a.

Theorems & Definitions (55)

Proposition 2.1
Proof 1: Proof of \ref{['camille']}
Definition 3.1
Proposition 3.2
Proof 2: Proof of \ref{['karljunior']}
Theorem 3.3
Corollary 3.4
Definition 3.5
Proposition 3.6
Proof 3: Proof of \ref{['karlsenior']}
...and 45 more

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

TL;DR

Abstract

Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (55)