Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization
Zikai Xiong, Robert M. Freund
TL;DR
This work addresses the challenge of large-scale ERM where the cost of exact gradient evaluation scales with the number of observations $n$. It introduces TUFW, a Taylor-point Updating Frank-Wolfe method that replaces full gradients with second-order Taylor-approximated gradients around per-example Taylor points, coupled with stochastic or deterministic batch-update rules to control gradient computations. The authors prove convergence guarantees for both convex and non-convex losses, derive specialized rates for ERM with linear prediction, and develop an adaptive step-size variant with competitive performance. Computational experiments on real datasets demonstrate substantial speed-ups over existing Frank-Wolfe methods, highlighting the practical impact of reducing $n$-dependence while preserving or approaching optimal convergence rates.
Abstract
The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
