First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

Zhankun Luo; Antesh Upadhyay; Sang Bin Moon; Abolfazl Hashemi

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

Zhankun Luo, Antesh Upadhyay, Sang Bin Moon, Abolfazl Hashemi

TL;DR

This framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches.

Abstract

This paper addresses the distributed stochastic minimax optimization problem subject to stochastic constraints. We propose a novel first-order Softmax-Weighted Switching Gradient method tailored for federated learning. Under full client participation, our algorithm achieves the standard $\mathcal{O}(ε^{-4})$ oracle complexity to satisfy a unified bound $ε$ for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp $\mathcal{O}(\log\frac{1}δ)$ high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

TL;DR

Abstract

oracle complexity to satisfy a unified bound

for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp

high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.

Paper Structure (22 sections, 14 theorems, 244 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 22 sections, 14 theorems, 244 equations, 4 figures, 2 tables, 3 algorithms.

Introduction
Non-smooth worst-case objective.
Coupling of constraints with minimax optimization.
Contributions
Problem Setup
Algorithm
Theoretical Analysis
Full participation and single local update.
Full participation and multiple local updates.
Partial participation and multiple local update.
Experiments
Conclusion
Algorithms
Lemmas Used in Proofs
Proof for Result of Federated Learning with Full Participation
...and 7 more sections

Key Result

lemma 11

Let $\psi$ be a convex function, then for Bregman divergence $D_{\psi}[\mathbf{x}||\mathbf{x}'] := \psi(\mathbf{x}) - \psi(\mathbf{x}') - \langle \nabla \psi(\mathbf{x}'), \mathbf{x} - \mathbf{x}'\rangle \geq 0$ with any $\mathbf{x}, \mathbf{x}'$ in the domain of $\psi$, the following identity holds

Figures (4)

Figure 1: NP classification. Objective $F(\mathbf{w}_k)$ and constraint $G(\mathbf{w}_k)$ vs. gradient evaluations. Comparisons against penalty and primal-dual baselines under full participation ($E=1, m=n$; top) and partial participation ($E=5, \frac{m}{n}=0.5$; bottom). Red dashed line: tolerance ($\epsilon$).
Figure 2: $\alpha$-sensitivity. Impact of temperature $\alpha$. High $\alpha$ approximates the hard $\max$ operator, while low $\alpha$ smooths the objective toward a simple average.
Figure 3: Fair classification. Comparisons against penalty and primal-dual baselines. Top: full participation ($E=1, m=n$). Bottom: partial participation ($E=2, \frac{m}{n}=0.5$).
Figure 4: Federated learning settings. Impact of the number of local epochs $E$ (top row) and the client participation ratio $m/n$ (bottom row).

Theorems & Definitions (42)

remark
remark
remark
definition 8: Stochastic Superiority via First-Order Stochastic Dominance (FSD), see shaked2007stochastic
remark
remark
remark
lemma 11: Three-point Bregman Divergence Identity, see equation (4.1) on page 297 of bubeck2015convex
lemma 12: Polarization Identity
lemma 13: Properties of Weighted Functions
...and 32 more

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

TL;DR

Abstract

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (42)