Table of Contents
Fetching ...

Blended Conditional Gradients: the unconditioning of conditional gradients

Gábor Braun, Sebastian Pokutta, Dan Tu, Stephen Wright

TL;DR

The paper introduces Blended Conditional Gradients (BCG), a projection-free optimization method for minimizing smooth convex functions over polytopes by blending Frank–Wolfe steps with simplex-based gradient descent. It leverages a weak-separation oracle and a simplex descent oracle to navigate the active vertex set efficiently, achieving linear convergence for strongly convex objectives via a simplicial-curvature–geometric-strong-convexity framework. Theoretical results show $f(x_T)-f(x^*) o 0$ at a rate $Oigl( rac{C^ riangle}{oldsymbol{ extmu}}oldsymbol{ extlog}( rac{oldsymbol{ ext Phi_0}}{oldsymbol{ ext eps}})igr)$ and practical experiments across Lasso, video co-localization, structured regression, matrix completion, and sparse recovery demonstrate substantial speedups and sparser solutions compared to standard CG variants. The work emphasizes projection-free operation, sparse representations, and lazy oracle evaluation, with additional simplex-specific variants and enhancements improving real-world performance and suggesting further extensions to broader growth conditions and acceleration strategies.

Abstract

We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank--Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality (primal and dual gaps) rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of [arXiv:1410.8816]. We also present a streamlined version of the algorithm for the probability simplex.

Blended Conditional Gradients: the unconditioning of conditional gradients

TL;DR

The paper introduces Blended Conditional Gradients (BCG), a projection-free optimization method for minimizing smooth convex functions over polytopes by blending Frank–Wolfe steps with simplex-based gradient descent. It leverages a weak-separation oracle and a simplex descent oracle to navigate the active vertex set efficiently, achieving linear convergence for strongly convex objectives via a simplicial-curvature–geometric-strong-convexity framework. Theoretical results show at a rate and practical experiments across Lasso, video co-localization, structured regression, matrix completion, and sparse recovery demonstrate substantial speedups and sparser solutions compared to standard CG variants. The work emphasizes projection-free operation, sparse representations, and lazy oracle evaluation, with additional simplex-specific variants and enhancements improving real-world performance and suggesting further extensions to broader growth conditions and acceleration strategies.

Abstract

We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank--Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality (primal and dual gaps) rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of [arXiv:1410.8816]. We also present a streamlined version of the algorithm for the probability simplex.

Paper Structure

This paper contains 19 sections, 5 theorems, 48 equations, 12 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.1

Let $f$ be a strongly convex, smooth function over the polytope $P$ with simplicial curvature $C^{\Delta}$ and geometric strong convexity $\mu$. Then Algorithm alg:LOLCG ensures $f(x_{T}) - f(x^{*}) \leq \varepsilon$, where $x^{*}$ is an optimal solution to $f$ in $P$ for some iteration index $T$ th where $\log$ denotes logarithms to the base $2$.

Figures (12)

  • Figure 1: Four representative examples. (Upper-left) Sparse signal recovery: $\min_{x \in \mathbb R^n: \normSimple{x}\ifblank{1}{}{\sb{1}} \leq \tau} \normSimple{y - \Phi x}\ifblank{2}{}{\sb{2}}^2$, where $\Phi$ is of size $1000 \times 3000$ with density $0.05$. BCG made $1402$ iterations with $155$ calls to the weak-separation oracle $\operatorname{LPsep}\sb{P}$. The final solution is a convex combination of $152$ vertices. (Upper-right) Lasso. We solve $\min_{x \in P} \normSimple{Ax - b}\ifblank{}{}{\sb{}}^2$ with $P$ being the (scaled) $\ell_1$-ball. $A$ is a $400 \times 2000$ matrix with $100$ non-zeros. BCG made $2130$ iterations, calling $\operatorname{LPsep}\sb{P}$$477$ times, with the final solution being a convex combination of $462$ vertices. (Lower-left) Structured regression over the Birkhoff polytope of dimension $50$. BCG made $2057$ iterations with $524$ calls to $\operatorname{LPsep}\sb{P}$. The final solution is a convex combination of $524$ vertices. (Lower-right) Video co-localization over netgen_12b polytope with an underlying $5000$-vertex graph. BCG made $140$ iterations, with $36$ calls to $\operatorname{LPsep}\sb{P}$. The final solution is a convex combination of $35$ vertices.
  • Figure 2: Comparison of BCG, ACG, PCG and CG on Lasso instances. Upper-left: $A$ is a $400 \times 2000$ matrix with $100$ non-zeros. BCG made $2130$ iterations, calling the LP oracle $477$ times, with the final solution being a convex combination of $462$ vertices giving the sparsity. Upper-right: $A$ is a $200 \times 200$ matrix with $100$ non-zeros. BCG made $13952$ iterations, calling the LP oracle $258$ times, with the final solution being a convex combination of $197$ vertices giving the sparsity. Lower-left: $A$ is a $500 \times 3000$ matrix with $100$ non-zeros. BCG made $3314$ iterations, calling the LP oracle $609$ times, with the final solution being a convex combination of $605$ vertices giving the sparsity. Lower-right: $A$ is a $1000 \times 1000$ matrix with $200$ non-zeros. BCG made $2328$ iterations, calling the LP oracle $1007$ times, with the final solution being a convex combination of $526$ vertices giving the sparsity.
  • Figure 3: Comparison of PCG, Lazy PCG, and BCG on video co-localization instances. Upper-Left: netgen_12b for a $3000$-vertex graph. BCG made $202$ iterations, called $\operatorname{LPsep}\sb{P}$$56$ times and the final solution is a convex combination of $56$ vertices. Upper-Right: netgen_12b over a $5000$-vertex graph. BCG did $212$ iterations, $\operatorname{LPsep}\sb{P}$ was talked $58$ times, and the final solution is a convex combination of $57$ vertices. Lower-Left: road_paths_01_DC_a over a $2000$-vertex graph. Even on instances where lazy PCG gains little advantage over PCG, BCG performs significantly better with empirically higher rate of convergence. BCG made $43$ iterations, $\operatorname{LPsep}\sb{P}$ was called $25$ times, and the final convex combination has $25$ vertices Lower-Right: netgen_08a over a $800$-vertex graph. BCG made $2794$ iterations, $\operatorname{LPsep}\sb{P}$ was called $222$ times, and the final convex combination has $106$ vertices.
  • Figure 4: Comparison of BCG, LPCG and PCG on structured regression instances. Upper-Left: Over the disctom polytope. BCG made $3526$ iterations with $1410$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $85$ vertices. Upper-Right: Over a maxcut polytope over a graph with $28$ vertices. BCG made $76$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $13$ vertices. Lower-Left: Over the m100n500k4r1 polytope. BCG made $2137$ iterations with $944$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $442$ vertices. Lower-right: Over the spanning tree polytope over the complete graph with $10$ nodes. BCG made $1983$ iterations with $262$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $247$ vertices. BCG outperforms LPCG and PCG, even in the cases where LPCG is much faster than PCG.
  • Figure 5: Comparison of BCG, ACG, PCG and CG over the Birkhoff polytope. Upper-Left: Dimension $50$. BCG made $2057$ iterations with $524$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $524$ vertices. Upper-Right: Dimension $100$. BCG made $151$ iterations with $134$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $134$ vertices. Lower-Left: Dimension $50$. BCG made $1040$ iterations with $377$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $377$ vertices. Lower-right: Dimension $80$. BCG made $429$ iterations with $239$$\operatorname{LPsep}\sb{P}$ calls and the final solution is a convex combination of $239$ vertices. BCG outperforms ACG, PCG and CG in all cases.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • proof
  • Lemma 4.1
  • proof
  • Corollary 4.2
  • proof
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof