Table of Contents
Fetching ...

On Sampling Strategies for Spectral Model Sharding

Denis Korzhenkov, Christos Louizos

TL;DR

This work presents two sampling strategies for Spectral model sharding, obtained as solutions to specific optimization problems, and demonstrates that both of these methods can lead to improved performance on various commonly used datasets.

Abstract

The problem of heterogeneous clients in federated learning has recently drawn a lot of attention. Spectral model sharding, i.e., partitioning the model parameters into low-rank matrices based on the singular value decomposition, has been one of the proposed solutions for more efficient on-device training in such settings. In this work, we present two sampling strategies for such sharding, obtained as solutions to specific optimization problems. The first produces unbiased estimators of the original weights, while the second aims to minimize the squared approximation error. We discuss how both of these estimators can be incorporated in the federated learning loop and practical considerations that arise during local training. Empirically, we demonstrate that both of these methods can lead to improved performance on various commonly used datasets.

On Sampling Strategies for Spectral Model Sharding

TL;DR

This work presents two sampling strategies for Spectral model sharding, obtained as solutions to specific optimization problems, and demonstrates that both of these methods can lead to improved performance on various commonly used datasets.

Abstract

The problem of heterogeneous clients in federated learning has recently drawn a lot of attention. Spectral model sharding, i.e., partitioning the model parameters into low-rank matrices based on the singular value decomposition, has been one of the proposed solutions for more efficient on-device training in such settings. In this work, we present two sampling strategies for such sharding, obtained as solutions to specific optimization problems. The first produces unbiased estimators of the original weights, while the second aims to minimize the squared approximation error. We discuss how both of these estimators can be incorporated in the federated learning loop and practical considerations that arise during local training. Empirically, we demonstrate that both of these methods can lead to improved performance on various commonly used datasets.

Paper Structure

This paper contains 25 sections, 2 theorems, 26 equations, 3 figures, 5 tables, 2 algorithms.

Key Result

Theorem 3.1

For an unbiased estimator $\hat{W}$ of the type specified in eq:general_estimator and consisting of $n$ terms, the Frobenius discrepancy can be expressed in terms of the marginal inclusion probabilities and the optimal set of inclusion probabilities has the following form for $i=1,\dots,N,$ where $t \in \left\{0,\dots,n-1\right\}$.

Figures (3)

  • Figure 1: Communication efficiency. The original PriSM method is too explorative (high ANME), while our 'unbiased' modification (+Wallenius) makes it the most exploitative strategy and allows to achieve the best performance in some experiments with limited computational budget.
  • Figure 2: Convergence analysis. When being trained longer, the proposed strategies demonstrate the decrease of the cross-entropy loss of the global model on the training set. Unbiased strategy reaches the train accuracy of $97.0\%$, and Collective strategy achieves $98.4\%$. This serves as empirical evidence of convergence for our method.
  • Figure 4: Comparison with FedAvg weight updates. For ResNet model trained on CIFAR-10, updates provided by the Top-$n$ strategy significantly deviate from those of FedAvg method. This correlates with the worse performance in this experiment.

Theorems & Definitions (7)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Remark B.1
  • Remark B.2
  • Remark B.3