Table of Contents
Fetching ...

Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization

Xingyu Chen, Bokun Wang, Ming Yang, Qihang Lin, Tianbao Yang

TL;DR

This work tackles non-smooth, non-convex finite-sum Coupled Compositional Optimization (FCCO) by introducing stochastic momentum methods that leverage outer (and nested) Moreau envelope smoothing to produce tractable surrogates. The authors propose two algorithms, SONEX for smooth inner functions and ALEXR2 for smooth or weakly convex inner functions, achieving a new state-of-the-art iteration complexity of $O(1/\epsilon^5)$. They further apply smoothing techniques to non-convex inequality-constrained problems via smoothed hinge penalties, obtaining near-optimal $\epsilon$-KKT guarantees with comparable rates. Empirical results on group DRO, AUC ROC fairness, and continual learning tasks show that the proposed methods outperform existing baselines in both optimization efficiency and constraint satisfaction, illustrating practical relevance for deep learning and robust optimization.

Abstract

Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of $O(1/ε^6)$ under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a new state-of-the-art iteration complexity of $O(1/ε^5)$. Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a new state-of-the-art complexity of $O(1/ε^5)$ for finding an (nearly) $ε$-level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.

Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization

TL;DR

This work tackles non-smooth, non-convex finite-sum Coupled Compositional Optimization (FCCO) by introducing stochastic momentum methods that leverage outer (and nested) Moreau envelope smoothing to produce tractable surrogates. The authors propose two algorithms, SONEX for smooth inner functions and ALEXR2 for smooth or weakly convex inner functions, achieving a new state-of-the-art iteration complexity of . They further apply smoothing techniques to non-convex inequality-constrained problems via smoothed hinge penalties, obtaining near-optimal -KKT guarantees with comparable rates. Empirical results on group DRO, AUC ROC fairness, and continual learning tasks show that the proposed methods outperform existing baselines in both optimization efficiency and constraint satisfaction, illustrating practical relevance for deep learning and robust optimization.

Abstract

Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a new state-of-the-art iteration complexity of . Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a new state-of-the-art complexity of for finding an (nearly) -level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.

Paper Structure

This paper contains 34 sections, 27 theorems, 118 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

Theorem 4.2

If $\mathbf{w}$ is an $\epsilon$-stationary solution to $F_{\lambda}(\cdot)$ with $\lambda=\epsilon/C_f$ such that $\|\nabla F_{\lambda}(\mathbf{w})\|\leq \epsilon$, then $\mathbf{w}$ is an approximate $\epsilon$-stationary solution to the original objective $F(\cdot)$.

Figures (5)

  • Figure 1: Training loss curves (left three) and testing accuracy (right one) of different methods for Group DRO with CVaR ratio $r=0.15$ on different datasets.
  • Figure 4: Training curves of 5 constraint values in zero-one loss of different methods for continual learning with non-forgetting constraints when targeting the foggy class. Top: squared-hinge penalty method with different $\rho$; Bottom: smoothed hinge penalty method with different $\rho$.
  • Figure 5: Training curves of 4 constraint values in zero-one loss of different methods for continual learning with non-forgetting constraints when targeting the overcast class. Top: squared-hinge penalty method with different $\rho$; Bottom: smoothed hinge penalty method with different $\rho$.
  • Figure 6: Training curves of 14 constraint values of different methods on adult dataset for AUC maximization with ROC fairness constraints. Top row: SOX with squared-hinge penalty method; Middle: SONX with Hinge penalty method; Bottom: ALEXR2 with smoothed hinge penalty method.
  • Figure 7: Training curves of 14 constraint values of different methods on COMPAS dataset for AUC maximization with ROC fairness constraints. Top row: SOX with squared-hinge penalty method; Middle: SONX with Hinge penalty method; Bottom: ALEXR2 with smoothed hinge penalty method.

Theorems & Definitions (45)

  • Definition 3.1
  • Definition 4.1
  • Theorem 4.2
  • Theorem 4.4
  • Theorem 4.7
  • Corollary 4.8
  • Theorem 4.10
  • Corollary 4.11
  • Proposition 5.2
  • Theorem 5.3
  • ...and 35 more