Table of Contents
Fetching ...

Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features

Aleksandr Beznosikov, David Dobre, Gauthier Gidel

TL;DR

This work introduces two stochastic Frank–Wolfe variants for constrained finite-sum minimization: Sarah Frank–Wolfe (fw_sarah) and Saga Sarah Frank–Wolfe (fw_zerosarah). They achieve state-of-the-art convergence guarantees for both convex and non-convex objectives while avoiding large-batch strategies and full-gradient computations when possible, leveraging variance-reduction techniques (SARAH, SAGA) within a projection-free, LMO-enabled framework. The analysis uses a Lyapunov quantity that tracks gradient-estimator accuracy, yielding explicit rates and optimal parameter choices (e.g., $p$ and $b$) and demonstrating favorable LMO and stochastic oracle complexities. Empirical results on LibSVM datasets validate the theory, showing competitive or superior performance with respect to existing projection-free baselines, and the work outlines directions for extending the approach to strongly convex settings and distributed settings with compression.

Abstract

The Frank-Wolfe (FW) method is a popular approach for solving optimization problems with structured constraints that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stochastic finite-sum minimization. Our algorithms have the best convergence guarantees of existing stochastic FW approaches for both convex and non-convex objective functions. Our methods do not have the issue of permanently collecting large batches, which is common to many stochastic projection-free approaches. Moreover, our second approach does not require either large batches or full deterministic gradients, which is a typical weakness of many techniques for finite-sum problems. The faster theoretical rates of our approaches are confirmed experimentally.

Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features

TL;DR

This work introduces two stochastic Frank–Wolfe variants for constrained finite-sum minimization: Sarah Frank–Wolfe (fw_sarah) and Saga Sarah Frank–Wolfe (fw_zerosarah). They achieve state-of-the-art convergence guarantees for both convex and non-convex objectives while avoiding large-batch strategies and full-gradient computations when possible, leveraging variance-reduction techniques (SARAH, SAGA) within a projection-free, LMO-enabled framework. The analysis uses a Lyapunov quantity that tracks gradient-estimator accuracy, yielding explicit rates and optimal parameter choices (e.g., and ) and demonstrating favorable LMO and stochastic oracle complexities. Empirical results on LibSVM datasets validate the theory, showing competitive or superior performance with respect to existing projection-free baselines, and the work outlines directions for extending the approach to strongly convex settings and distributed settings with compression.

Abstract

The Frank-Wolfe (FW) method is a popular approach for solving optimization problems with structured constraints that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stochastic finite-sum minimization. Our algorithms have the best convergence guarantees of existing stochastic FW approaches for both convex and non-convex objective functions. Our methods do not have the issue of permanently collecting large batches, which is common to many stochastic projection-free approaches. Moreover, our second approach does not require either large batches or full deterministic gradients, which is a typical weakness of many techniques for finite-sum problems. The faster theoretical rates of our approaches are confirmed experimentally.
Paper Structure (19 sections, 17 theorems, 96 equations, 6 figures, 2 tables)

This paper contains 19 sections, 17 theorems, 96 equations, 6 figures, 2 tables.

Key Result

Theorem 4.1

Let $\{x^k\}_{k\geq0}$ denote the iterates of Algorithm alg:fw_sarah for solving problem eq:main_problem, which satisfies Assumptions as:lip--as:set. Let $x^*$ be the minimizer of $f$. Then for any $K$ one can choose $\{ \eta_k \}_{k \geq 0}$ as follows: For this choice of $\eta_k$, we have the following convergence:

Figures (6)

  • Figure 1: Comparison of state-of-the-art projection free methods with small batches for \ref{['eq;ls']}. The comparison is made on the real datasets from LibSVM. The criterion is the number of full gradients computations. In the modified plots (the right plots in the first three lines), we left only every 100th point for negiar2020stochastic, weber2022projection, Algorithm \ref{['alg:fw_sarah']} and Algorithm \ref{['alg:fw_zerosarah']}.
  • Figure 2: Comparison of state-of-the-art projection free methods with small batches for \ref{['eq;nls']}. The comparison is made on the real datasets from LibSVM. The criterion is the number of full gradients computations. In the modified plots (the right plots in the first three lines), we left only every 100th point for negiar2020stochastic, weber2022projection, Algorithm \ref{['alg:fw_sarah']} and Algorithm \ref{['alg:fw_zerosarah']}.
  • Figure 2: Datasets from LibSVM in experiments.
  • Figure 3: Comparison of state-of-the-art projection free methods with small batches for \ref{['eq;ls']} with $R = 200$. The comparison is made on the real datasets from LibSVM. The criterion is the number of full gradients computations.
  • Figure 4: Comparison of state-of-the-art projection free methods with small batches for \ref{['eq;ls']} with $R = 20$. The comparison is made on the real datasets from LibSVM. The criterion is the number of full gradients computations.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Theorem 4.1
  • Corollary 4.2
  • Theorem 4.3
  • Corollary 4.4
  • Theorem 4.5
  • Corollary 4.6
  • Theorem 4.7
  • Corollary 4.8
  • Lemma 1.1
  • Lemma 1.2: Lemma 1.2.3 from nesterov2003introductory
  • ...and 8 more