Table of Contents
Fetching ...

FedDW: Distilling Weights through Consistency Optimization in Heterogeneous Federated Learning

Jiayu Liu, Yong Wang, Nianbin Wang, Jing Yang, Xiaohui Tao

TL;DR

FedDW targets non-IID challenges in federated learning by enforcing IID-like class-structure through a consistency regularization between global soft-label matrices and the classifier weight-derived class relations. It defines DLE data, builds a global SL matrix, and regularizes the local classifier via Frobenius distance to align CR and SL under heterogeneity, with global aggregation and local updates guiding training. The authors provide convergence analysis, discuss regularization properties and convex approximations, and demonstrate through extensive experiments on MNIST, CIFAR-10/100, and IMDB that FedDW improves accuracy with negligible additional computation and communication, while maintaining compatibility with existing FL methods. The results suggest FedDW is scalable across client counts, rounds, and model architectures, making it a practical approach for large-scale heterogeneous FL improvements. Overall, FedDW offers a principled, efficient mechanism to preserve global class relationships in distributed learning, improving generalization under non-IID data distributions.

Abstract

Federated Learning (FL) is an innovative distributed machine learning paradigm that enables neural network training across devices without centralizing data. While this addresses issues of information sharing and data privacy, challenges arise from data heterogeneity across clients and increasing network scale, leading to impacts on model performance and training efficiency. Previous research shows that in IID environments, the parameter structure of the model is expected to adhere to certain specific consistency principles. Thus, identifying and regularizing these consistencies can mitigate issues from heterogeneous data. We found that both soft labels derived from knowledge distillation and the classifier head parameter matrix, when multiplied by their own transpose, capture the intrinsic relationships between data classes. These shared relationships suggest inherent consistency. Therefore, the work in this paper identifies the consistency between the two and leverages it to regulate training, underpinning our proposed FedDW framework. Experimental results show FedDW outperforms 10 state-of-the-art FL methods, improving accuracy by an average of 3% in highly heterogeneous settings. Additionally, we provide a theoretical proof that FedDW offers higher efficiency, with the additional computational load from backpropagation being negligible. The code is available at https://github.com/liuvvvvv1/FedDW.

FedDW: Distilling Weights through Consistency Optimization in Heterogeneous Federated Learning

TL;DR

FedDW targets non-IID challenges in federated learning by enforcing IID-like class-structure through a consistency regularization between global soft-label matrices and the classifier weight-derived class relations. It defines DLE data, builds a global SL matrix, and regularizes the local classifier via Frobenius distance to align CR and SL under heterogeneity, with global aggregation and local updates guiding training. The authors provide convergence analysis, discuss regularization properties and convex approximations, and demonstrate through extensive experiments on MNIST, CIFAR-10/100, and IMDB that FedDW improves accuracy with negligible additional computation and communication, while maintaining compatibility with existing FL methods. The results suggest FedDW is scalable across client counts, rounds, and model architectures, making it a practical approach for large-scale heterogeneous FL improvements. Overall, FedDW offers a principled, efficient mechanism to preserve global class relationships in distributed learning, improving generalization under non-IID data distributions.

Abstract

Federated Learning (FL) is an innovative distributed machine learning paradigm that enables neural network training across devices without centralizing data. While this addresses issues of information sharing and data privacy, challenges arise from data heterogeneity across clients and increasing network scale, leading to impacts on model performance and training efficiency. Previous research shows that in IID environments, the parameter structure of the model is expected to adhere to certain specific consistency principles. Thus, identifying and regularizing these consistencies can mitigate issues from heterogeneous data. We found that both soft labels derived from knowledge distillation and the classifier head parameter matrix, when multiplied by their own transpose, capture the intrinsic relationships between data classes. These shared relationships suggest inherent consistency. Therefore, the work in this paper identifies the consistency between the two and leverages it to regulate training, underpinning our proposed FedDW framework. Experimental results show FedDW outperforms 10 state-of-the-art FL methods, improving accuracy by an average of 3% in highly heterogeneous settings. Additionally, we provide a theoretical proof that FedDW offers higher efficiency, with the additional computational load from backpropagation being negligible. The code is available at https://github.com/liuvvvvv1/FedDW.

Paper Structure

This paper contains 22 sections, 27 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Schematic diagram of DLE data transmission in the federated system and three common deep encryption data.
  • Figure 2: The client uses DLE data for regular optimization, generating two types: $e_1$ for global aggregation to capture global information, and $e_2$ to guide optimization toward global generalization by aligning with $e_1$. $e_1$ must be chosen to ensure it retains generalization after aggregation. In particular, $e_1$ and $e_2$ may be equal.
  • Figure 3: The entire training process of FedDW.
  • Figure 4: We test each dataset 100 times. Due to space constraints, only three different data (D1, D2, D3) are randomly selected here to display the modulus values. The size of the blue circle represents the proportion of the data in this class. The larger the circle, the more data there is in this class. The white circle means there is no data in this class. The number under the circle represents the relative size of the weight vector modulus of the class corresponding to this class, while the calculation formula of other classes is $\frac{\| \omega_{c_k} \|_2}{\sum_{i=1}^{|\mathbf{C}|} \| \omega_{c_i} \|_2}$. The red number represents the counterexample shown. In 100 experiments on three data sets, we only found 3 counterexamples.
  • Figure 5: We use "X" to represent the weight vector of each class of data in the classification layer, and the midpoint of "X" represents the specific position of the weight vector in the visualization space. We can see that each weight vector belongs to the cluster of the corresponding class. Note that, before visualization, we need to perform Vector Unitization.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1: DLE Data
  • Definition 2: DLE Consistency Optimization