Table of Contents
Fetching ...

On the effects of similarity metrics in decentralized deep learning under distributional shift

Edvin Listo Zec, Tom Hagander, Eric Ihre-Thomason, Sarunas Girdzijauskas

TL;DR

This work tackles the challenge of aggregation in decentralized learning under distributional shift by evaluating four similarity metrics to identify beneficial peers for model merging and by introducing FedSim, a similarity-weighted aggregation rule. The authors conduct extensive experiments across concept, covariate, domain, and label shifts on synthetic and real datasets, showing that cosine similarity on weights or gradients often yields robust peer selection and that FedSim can outperform traditional FedAvg, especially under strong cluster divergence. They also demonstrate that inverse empirical loss can be noisy and that the $L^2$ distance is generally weaker, while pre-training and clustering dynamics can complicate similarity signals. The findings provide practical guidance for designing robust, privacy-preserving decentralized learning systems and motivate future work on theory, privacy-preserving similarity measures, and scalable aggregation under non-iid data.

Abstract

Decentralized Learning (DL) enables privacy-preserving collaboration among organizations or users to enhance the performance of local deep learning models. However, model aggregation becomes challenging when client data is heterogeneous, and identifying compatible collaborators without direct data exchange remains a pressing issue. In this paper, we investigate the effectiveness of various similarity metrics in DL for identifying peers for model merging, conducting an empirical analysis across multiple datasets with distribution shifts. Our research provides insights into the performance of these metrics, examining their role in facilitating effective collaboration. By exploring the strengths and limitations of these metrics, we contribute to the development of robust DL methods.

On the effects of similarity metrics in decentralized deep learning under distributional shift

TL;DR

This work tackles the challenge of aggregation in decentralized learning under distributional shift by evaluating four similarity metrics to identify beneficial peers for model merging and by introducing FedSim, a similarity-weighted aggregation rule. The authors conduct extensive experiments across concept, covariate, domain, and label shifts on synthetic and real datasets, showing that cosine similarity on weights or gradients often yields robust peer selection and that FedSim can outperform traditional FedAvg, especially under strong cluster divergence. They also demonstrate that inverse empirical loss can be noisy and that the distance is generally weaker, while pre-training and clustering dynamics can complicate similarity signals. The findings provide practical guidance for designing robust, privacy-preserving decentralized learning systems and motivate future work on theory, privacy-preserving similarity measures, and scalable aggregation under non-iid data.

Abstract

Decentralized Learning (DL) enables privacy-preserving collaboration among organizations or users to enhance the performance of local deep learning models. However, model aggregation becomes challenging when client data is heterogeneous, and identifying compatible collaborators without direct data exchange remains a pressing issue. In this paper, we investigate the effectiveness of various similarity metrics in DL for identifying peers for model merging, conducting an empirical analysis across multiple datasets with distribution shifts. Our research provides insights into the performance of these metrics, examining their role in facilitating effective collaboration. By exploring the strengths and limitations of these metrics, we contribute to the development of robust DL methods.
Paper Structure (26 sections, 1 theorem, 11 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 11 equations, 10 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

Consider a single round $t$ in the batch stochastic gradient setting with learning rate $\eta$. Let each sampled client $i\in [m]$ compute its local update $w_{i, t} = w_t - \eta \nabla_w \hat{R}_i(w_t)$, where $\hat{R}_i(w_t)$ is the empirical risk on client $i$. Then, the aggregated update using t where $w_{t+1}^{P_k} = w_t - \eta \nabla_w \hat{R}_{P_k}(w_t)$ is the batch SGD update that would b

Figures (10)

  • Figure 1: Illustrations of different types of machine learning paradigms. (a) In traditional ML, a model is trained on a centralized dataset. (b) In FL, a central server orchestrates the training of a global model across multiple private datasets. (c) In DL, there is no central server; instead clients communicate and merge models in a peer-to-peer network.
  • Figure 2: Heatmaps of client communication, indicating how often client $x$ communicated with client $y$ for the four different similarity metrics.
  • Figure 3: Test loss performance for all methods on the linear regression problem with concept shift.
  • Figure 4: Test accuracy for all methods on the domain shift problem, where half of clients have data from the MNIST dataset and the other half have data from the Fashion-MNIST dataset. Results for two different model architectures: (a) an MLP and (b) a CNN.
  • Figure 5: Test accuracy for all methods on different problems: Fashion-MNIST covariate shift (a), pre-trained CIFAR-100 label shift (b), and the CIFAR-10 label shifts (c,d).
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 1: Unbiased estimator
  • proof : Proof sketch