Table of Contents
Fetching ...

Deep-Relative-Trust-Based Diffusion for Decentralized Deep Learning

Muyun Li, Aaron Fainman, Stefan Vlaski

TL;DR

This work tackles decentralized training of deep networks on non-IID data by shifting from parameter-space consensus to function-space consensus using Deep Relative Trust (DRT). By formulating a penalty based on output similarity and deriving layer-wise, time-varying mixing, the authors provide convergence guarantees for the network centroid and bounded disagreement under standard stochastic-gradient assumptions. Empirical results on CIFAR-10 with ResNet-20 show that DRT diffusion improves steady-state accuracy and reduces generalization gaps on sparse topologies, while maintaining convergence speed comparable to fast-mixing diffusion. The proposed approach highlights the potential of leveraging over-parameterization to promote function-level consensus, enabling robust decentralized learning in communication-constrained or irregular networks.

Abstract

Decentralized learning strategies allow a collection of agents to learn efficiently from local data sets without the need for central aggregation or orchestration. Current decentralized learning paradigms typically rely on an averaging mechanism to encourage agreement in the parameter space. We argue that in the context of deep neural networks, which are often over-parameterized, encouraging consensus of the neural network outputs, as opposed to their parameters can be more appropriate. This motivates the development of a new decentralized learning algorithm, termed DRT diffusion, based on deep relative trust (DRT), a recently introduced similarity measure for neural networks. We provide convergence analysis for the proposed strategy, and numerically establish its benefit to generalization, especially with sparse topologies, in an image classification task.

Deep-Relative-Trust-Based Diffusion for Decentralized Deep Learning

TL;DR

This work tackles decentralized training of deep networks on non-IID data by shifting from parameter-space consensus to function-space consensus using Deep Relative Trust (DRT). By formulating a penalty based on output similarity and deriving layer-wise, time-varying mixing, the authors provide convergence guarantees for the network centroid and bounded disagreement under standard stochastic-gradient assumptions. Empirical results on CIFAR-10 with ResNet-20 show that DRT diffusion improves steady-state accuracy and reduces generalization gaps on sparse topologies, while maintaining convergence speed comparable to fast-mixing diffusion. The proposed approach highlights the potential of leveraging over-parameterization to promote function-level consensus, enabling robust decentralized learning in communication-constrained or irregular networks.

Abstract

Decentralized learning strategies allow a collection of agents to learn efficiently from local data sets without the need for central aggregation or orchestration. Current decentralized learning paradigms typically rely on an averaging mechanism to encourage agreement in the parameter space. We argue that in the context of deep neural networks, which are often over-parameterized, encouraging consensus of the neural network outputs, as opposed to their parameters can be more appropriate. This motivates the development of a new decentralized learning algorithm, termed DRT diffusion, based on deep relative trust (DRT), a recently introduced similarity measure for neural networks. We provide convergence analysis for the proposed strategy, and numerically establish its benefit to generalization, especially with sparse topologies, in an image classification task.
Paper Structure (13 sections, 5 theorems, 51 equations, 2 figures, 1 table)

This paper contains 13 sections, 5 theorems, 51 equations, 2 figures, 1 table.

Key Result

Lemma 1

Under Assumption assp: sc and the construction eqn:matrixCons, the graph represented by the weighted combination matrix $\bm{A}_{i}^{(p)} \triangleq \left[\bm{a}_{\ell k, i}^{(p)}\right]$ is compatible with the graph described by $C$ for all $p$ and all $i$ in the sense that: Moreover, for all $p$ and all $i$, the nonzero elements in the mixing matrices are lower bounded as follows: ∎

Figures (2)

  • Figure 1: Learning curves for a decentralized network with 16 agents, employing ResNet-20 on CIFAR-10 with non-IID data at each agent
  • Figure 2: Generalization gap for a decentralized network with 16 agents, employing ResNet-20 on CIFAR-10 with non-IID data at each agent

Theorems & Definitions (9)

  • Lemma 1: Graph-compatible $\bm{A}_{i}^{(p)}$
  • proof
  • Lemma 2: Time-varying weight vector Tsitsiklis84
  • Lemma 3: Network Disagreement
  • proof
  • Theorem 1: Descent Relation
  • proof
  • Lemma 4: Perturbation bounds
  • proof