Table of Contents
Fetching ...

Revisiting Frank-Wolfe for Structured Nonconvex Optimization

Hoomaan Maskan, Yikun Hou, Suvrit Sra, Alp Yurtsever

TL;DR

The paper addresses nonconvex optimization expressed as a difference of convex functions under a projection-free setting. It introduces Dc-Fw, a framework that couples the DC Algorithm with Frank-Wolfe, and analyzes two natural DC decompositions that yield distinct variants, including a gradient-efficient CGS-style method and an inexact proximal-point method. Theoretical results show first-order stationarity in 𝒪(1/ε^2) FW steps and, for suitably structured domains, improved gradient and LMO complexities, with empirical validation on quadratic assignment and partially observed embedding alignment. The work demonstrates how problem reformulation via DC decompositions can meaningfully enhance projection-free optimization, and points to stochastic and adaptive decomposition extensions as promising directions.

Abstract

We introduce a new projection-free (Frank-Wolfe) method for optimizing structured nonconvex functions that are expressed as a difference of two convex functions. This problem class subsumes smooth nonconvex minimization, positioning our method as a promising alternative to the classical Frank-Wolfe algorithm. DC decompositions are not unique; by carefully selecting a decomposition, we can better exploit the problem structure, improve computational efficiency, and adapt to the underlying problem geometry to find better local solutions. We prove that the proposed method achieves a first-order stationary point in $O(1/ε^2)$ iterations, matching the complexity of the standard Frank-Wolfe algorithm for smooth nonconvex minimization in general. Specific decompositions can, for instance, yield a gradient-efficient variant that requires only $O(1/ε)$ calls to the gradient oracle. Finally, we present numerical experiments demonstrating the effectiveness of the proposed method compared to other projection-free algorithms.

Revisiting Frank-Wolfe for Structured Nonconvex Optimization

TL;DR

The paper addresses nonconvex optimization expressed as a difference of convex functions under a projection-free setting. It introduces Dc-Fw, a framework that couples the DC Algorithm with Frank-Wolfe, and analyzes two natural DC decompositions that yield distinct variants, including a gradient-efficient CGS-style method and an inexact proximal-point method. Theoretical results show first-order stationarity in 𝒪(1/ε^2) FW steps and, for suitably structured domains, improved gradient and LMO complexities, with empirical validation on quadratic assignment and partially observed embedding alignment. The work demonstrates how problem reformulation via DC decompositions can meaningfully enhance projection-free optimization, and points to stochastic and adaptive decomposition extensions as promising directions.

Abstract

We introduce a new projection-free (Frank-Wolfe) method for optimizing structured nonconvex functions that are expressed as a difference of two convex functions. This problem class subsumes smooth nonconvex minimization, positioning our method as a promising alternative to the classical Frank-Wolfe algorithm. DC decompositions are not unique; by carefully selecting a decomposition, we can better exploit the problem structure, improve computational efficiency, and adapt to the underlying problem geometry to find better local solutions. We prove that the proposed method achieves a first-order stationary point in iterations, matching the complexity of the standard Frank-Wolfe algorithm for smooth nonconvex minimization in general. Specific decompositions can, for instance, yield a gradient-efficient variant that requires only calls to the gradient oracle. Finally, we present numerical experiments demonstrating the effectiveness of the proposed method compared to other projection-free algorithms.

Paper Structure

This paper contains 38 sections, 10 theorems, 61 equations, 6 figures, 1 algorithm.

Key Result

Lemma 1

The measure $\mathrm{gap}_{\textsc{dc}}(x_t)$ is nonnegative for any $x_t \in \mathcal{D}$, and it is equal to zero if and only if $x_t$ is a critical point satisfying where $\mathcal{N}_D(x_t)$ is the normal cone of $\mathcal{D}$ at $x_t$. Moreover, if $g$ is differentiable (but not necessarily smooth, i.e., its gradients may not be Lipschitz continuous), then the condition eqn:critical-point re

Figures (6)

  • Figure 1: Comparison of the DC gap function for different decompositions of $\phi(x_1,x_2) = \sin(\pi x_1)\cos(\pi x_2)$ on the domain $[-1,1]^2$. [Left] Level curves of $\phi$. [Middle]$\mathrm{gap}^L_{\textsc{pgm}}$, corresponding to the decomposition in \ref{['subsec:cgs']}, which linearizes $\phi$ and therefore does not distinguish between local minima, saddle points, and local maxima. [Right]$\mathrm{gap}^L_{\textsc{ppm}}$, corresponding to the decomposition in \ref{['subsec:ppfw']}, which retains curvature information in $\phi$; it is flatter around local minima and sharper around saddle points and local maxima.
  • Figure 2: Assignment error of FW and Dc-Fw for solving QAP using relax-and-round strategy. Zero shows an exact solution. The instances are ordered from best to worst performance of FW. In total, 134 datasets from QAPLIB were used: Dc-Fw outperformed FW in 73 cases, FW performed better in 43 cases, and both methods achieved the same assignment error in 18 cases.
  • Figure 3: FW gap evolution as a function of the iteration counter (left), the number of SVD computations (middle), and the wall clock time (right) for the alignment of partially observed embeddings. In $10^4$ iterations, FW-K and FW-M called the subgradient and linear minimization oracles $10^4$ times each; FW-M performed an additional $19,990$ function evaluations during the backtracking line-search; Dc-Fw called the linear minimization $10^4$ times and the subgradient $88$ times.
  • Figure 4: Comparison between FW, DC--FW variants, and CGS with 90% confidence intervals.
  • Figure 5: Comparing FW and Dc-Fw to train classification task using CE loss using transfer learning on EfficientNetB0 and a customized CNN. The training datasets were CIFAR-10 and CIFAR-100. In all the Figures, dashed and solid lines refer to the validation and the training data, respectively.
  • ...and 1 more figures

Theorems & Definitions (23)

  • Definition 1
  • Lemma 1
  • Theorem 2
  • Lemma 3: Theorem 1 in jaggi2013revisiting
  • Corollary 4
  • Definition 2
  • Lemma 5: Theorem 2 in garber2015faster
  • Corollary 6
  • Remark 7: Comparison with khamaru2019convergencemillan2023frank
  • Remark 8
  • ...and 13 more