Table of Contents
Fetching ...

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

Zhankun Luo, Antesh Upadhyay, Sang Bin Moon, Abolfazl Hashemi

TL;DR

This framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches.

Abstract

This paper addresses the distributed stochastic minimax optimization problem subject to stochastic constraints. We propose a novel first-order Softmax-Weighted Switching Gradient method tailored for federated learning. Under full client participation, our algorithm achieves the standard $\mathcal{O}(ε^{-4})$ oracle complexity to satisfy a unified bound $ε$ for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp $\mathcal{O}(\log\frac{1}δ)$ high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

TL;DR

This framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches.

Abstract

This paper addresses the distributed stochastic minimax optimization problem subject to stochastic constraints. We propose a novel first-order Softmax-Weighted Switching Gradient method tailored for federated learning. Under full client participation, our algorithm achieves the standard oracle complexity to satisfy a unified bound for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.
Paper Structure (22 sections, 14 theorems, 244 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 22 sections, 14 theorems, 244 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

lemma 11

Let $\psi$ be a convex function, then for Bregman divergence $D_{\psi}[\mathbf{x}||\mathbf{x}'] := \psi(\mathbf{x}) - \psi(\mathbf{x}') - \langle \nabla \psi(\mathbf{x}'), \mathbf{x} - \mathbf{x}'\rangle \geq 0$ with any $\mathbf{x}, \mathbf{x}'$ in the domain of $\psi$, the following identity holds

Figures (4)

  • Figure 1: NP classification. Objective $F(\mathbf{w}_k)$ and constraint $G(\mathbf{w}_k)$ vs. gradient evaluations. Comparisons against penalty and primal-dual baselines under full participation ($E=1, m=n$; top) and partial participation ($E=5, \frac{m}{n}=0.5$; bottom). Red dashed line: tolerance ($\epsilon$).
  • Figure 2: $\alpha$-sensitivity. Impact of temperature $\alpha$. High $\alpha$ approximates the hard $\max$ operator, while low $\alpha$ smooths the objective toward a simple average.
  • Figure 3: Fair classification. Comparisons against penalty and primal-dual baselines. Top: full participation ($E=1, m=n$). Bottom: partial participation ($E=2, \frac{m}{n}=0.5$).
  • Figure 4: Federated learning settings. Impact of the number of local epochs $E$ (top row) and the client participation ratio $m/n$ (bottom row).

Theorems & Definitions (42)

  • remark
  • remark
  • remark
  • definition 8: Stochastic Superiority via First-Order Stochastic Dominance (FSD), see shaked2007stochastic
  • remark
  • remark
  • remark
  • lemma 11: Three-point Bregman Divergence Identity, see equation (4.1) on page 297 of bubeck2015convex
  • lemma 12: Polarization Identity
  • lemma 13: Properties of Weighted Functions
  • ...and 32 more