Table of Contents
Fetching ...

Stability and Generalization for Distributed SGDA

Miaoxi Zhu, Yan Sun, Li Shen, Bo Du, Dacheng Tao

TL;DR

The theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk and unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA.

Abstract

Minimax optimization is gaining increasing attention in modern machine learning applications. Driven by large-scale models and massive volumes of data collected from edge devices, as well as the concern to preserve client privacy, communication-efficient distributed minimax optimization algorithms become popular, such as Local Stochastic Gradient Descent Ascent (Local-SGDA), and Local Decentralized SGDA (Local-DSGDA). While most existing research on distributed minimax algorithms focuses on convergence rates, computation complexity, and communication efficiency, the generalization performance remains underdeveloped, whereas generalization ability is a pivotal indicator for evaluating the holistic performance of a model when fed with unknown data. In this paper, we propose the stability-based generalization analytical framework for Distributed-SGDA, which unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA, and conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics under various settings, e.g., (S)C-(S)C, PL-SC, and NC-NC cases. Our theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk. Numerical experiments for Local-SGDA and Local-DSGDA validate the theoretical results.

Stability and Generalization for Distributed SGDA

TL;DR

The theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk and unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA.

Abstract

Minimax optimization is gaining increasing attention in modern machine learning applications. Driven by large-scale models and massive volumes of data collected from edge devices, as well as the concern to preserve client privacy, communication-efficient distributed minimax optimization algorithms become popular, such as Local Stochastic Gradient Descent Ascent (Local-SGDA), and Local Decentralized SGDA (Local-DSGDA). While most existing research on distributed minimax algorithms focuses on convergence rates, computation complexity, and communication efficiency, the generalization performance remains underdeveloped, whereas generalization ability is a pivotal indicator for evaluating the holistic performance of a model when fed with unknown data. In this paper, we propose the stability-based generalization analytical framework for Distributed-SGDA, which unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA, and conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics under various settings, e.g., (S)C-(S)C, PL-SC, and NC-NC cases. Our theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk. Numerical experiments for Local-SGDA and Local-DSGDA validate the theoretical results.

Paper Structure

This paper contains 31 sections, 17 theorems, 109 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For a randomized $\epsilon$-stable distributed minimax algorithm $\mathcal{A}$, we have the following generalization gap for model $(\mathcal{A}_{\bm{x}}(\mathcal{S}),\!\mathcal{A}_{\bm{y}}(\mathcal{S}))$ training on the dataset $\mathcal{S}$,

Figures (3)

  • Figure 1: The first row shows the stability of the generator model using Local-SGDA method, the second row shows the discriminator model. From left to right, the figures correspond to the varying learning rates, the number of nodes, the local dataset size, and the number of local steps. Each layer is independently assessed and shown as the dashed lines.
  • Figure 3: The first row shows the stability of Local-SGDA on AUC Maximization task, which is evaluated by the Euclidean distance between outputs of models trained on neighbouring dataset. The second row shows the generalization performance, evaluated by abs(training loss - test loss). From left to right, the figures correspond to the varying learning rates, the number of nodes, the local dataset size, and the number of local updates.
  • Figure : Distributed-SGDA ($\mathcal{A}(T,K,\bm{W})$)

Theorems & Definitions (52)

  • Definition 1
  • Remark 1
  • Remark 2
  • Remark 3
  • Definition 2: Weak Primal-Dual(PD) Generalization Gap
  • Definition 3: Excess Primal Generalization Gap
  • Definition 4: Distributed neighboring dataset
  • Definition 5: Distributed algorithmic stability
  • Remark 4
  • Definition 6: Convexity-Concavity
  • ...and 42 more