Adaptive Weighting Push-SUM for Decentralized Optimization with Statistical Diversity
Yiming Zhou, Yifei Cheng, Linli Xu, Enhong Chen
TL;DR
This work tackles non-IID data hindering Push-SUM in directed decentralized optimization by formulating Adaptive Weighting Push-SUM (AWPS), a generalized weight framework that improves consensus (bounding the distance between local and global parameters) from $O(1)$ to $O(1/N)$. Building on AWPS, the authors introduce SGAP and MSGAP (SGD and Momentum SGD variants) and prove convergence rates to statistical diversity at $O(N/T)$, outperforming the $O(Nd/T)$ bound faced by the original Push-SUM setup. To translate theory into practice, they develop the Moreau weighting method based on the Moreau envelope, enabling practically efficient distance-penalty handling with buffer-based approximations and tunable hyperparameters $v$ and $k$. Theoretical results are complemented by deep-learning experiments (e.g., ResNet on CIFAR-10/100) showing improved accuracy and reduced wall-clock time under data non-IID conditions, confirming AWPS with Moreau weighting yields superior performance in decentralized nonconvex optimization. Overall, the paper presents a principled protocol and algorithmic suite that enhance consensus and convergence in directed, decentralized networks facing statistical diversity, with tangible benefits for large-scale neural-network training.
Abstract
Statistical diversity is a property of data distribution and can hinder the optimization of a decentralized network. However, the theoretical limitations of the Push-SUM protocol reduce the performance in handling the statistical diversity of optimization algorithms based on it. In this paper, we theoretically and empirically mitigate the negative impact of statistical diversity on decentralized optimization using the Push-SUM protocol. Specifically, we propose the Adaptive Weighting Push-SUM protocol, a theoretical generalization of the original Push-SUM protocol where the latter is a special case of the former. Our theoretical analysis shows that, with sufficient communication, the upper bound on the consensus distance for the new protocol reduces to $O(1/N)$, whereas it remains at $O(1)$ for the Push-SUM protocol. We adopt SGD and Momentum SGD on the new protocol and prove that the convergence rate of these two algorithms to statistical diversity is $O(N/T)$ on the new protocol, while it is $O(Nd/T)$ on the Push-SUM protocol, where $d$ is the parameter size of the training model. To address statistical diversity in practical applications of the new protocol, we develop the Moreau weighting method for its generalized weight matrix definition. This method, derived from the Moreau envelope, is an approximate optimization of the distance penalty of the Moreau envelope. We verify that the Adaptive Weighting Push-SUM protocol is practically more efficient than the Push-SUM protocol via deep learning experiments.
