Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning
Tolga Dimlioglu, Anna Choromanska
TL;DR
This work tackles the challenge of communication-efficient distributed training by reexamining flat minima and introducing a practical sharpness proxy, Inverse Mean Valley (Inv. MV). It embeds a lightweight relaxed regularizer into the training objective to realize Distributed Pull-Push Force (DPPF), which couples a pull toward consensus with a push that encourages exploration of wide, flat valleys. The authors prove that the final valley width scales with the pull/push ratio $\lambda/\alpha$ and provide a PAC-Bayes generalization bound linked to valley width, then validate the approach across CNNs and ViTs, showing improved generalization and reduced communication relative to strong baselines and SAM-like performance. Collectively, DPPF offers a theoretically grounded, practically effective strategy for achieving flat minima under communication constraints, with broad applicability and robust empirical gains.
Abstract
We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
