Table of Contents
Fetching ...

Dual-Distilled Heterogeneous Federated Learning with Adaptive Margins for Trainable Global Prototypes

Fatema Siddika, Md Anwar Hossen, Wensheng Zhang, Anuj Sharma, Juan Pablo Muñoz, Ali Jannesari

TL;DR

The paper tackles heterogeneity in Federated Learning by introducing FedProtoKD, a framework that combines dual knowledge distillation with adaptive, class-wise trainable prototypes to prevent prototype margin shrink during aggregation. It employs a learnable projection to align heterogeneous feature spaces, a contrastive generator to synthesize server prototypes with per-class adaptive margins, and a variance-weighted logit aggregation plus quality-aware public-data prioritization to distill robust global knowledge. The approach is validated on CIFAR-10/100 and Tiny-ImageNet across extreme and moderate non-IID settings, showing consistent improvements in server and client accuracy and demonstrating robustness to model and data heterogeneity as well as scalability factors. Overall, FedProtoKD advances prototype-based HFL by maintaining discriminative class boundaries and efficient cross-model knowledge transfer while preserving client privacy.

Abstract

Heterogeneous Federated Learning (HFL) has gained significant attention for its capacity to handle both model and data heterogeneity across clients. Prototype-based HFL methods emerge as a promising solution to address statistical and model heterogeneity as well as privacy challenges, paving the way for new advancements in HFL research. This method focuses on sharing class-representative prototypes among heterogeneous clients. However, aggregating these prototypes via standard weighted averaging often yields sub-optimal global knowledge. Specifically, the averaging approach induces a shrinking of the aggregated prototypes' decision margins, thereby degrading model performance in scenarios with model heterogeneity and non-IID data distributions. The propose FedProtoKD in a Heterogeneous Federated Learning setting, utilizing an enhanced dual-knowledge distillation mechanism to enhance system performance by leveraging clients' logits and prototype feature representations. The proposed framework aims to resolve the prototype margin-shrinking problem using a contrastive learning-based trainable server prototype by leveraging a class-wise adaptive prototype margin. Furthermore, the framework assess the importance of public samples using the closeness of the sample's prototype to its class representative prototypes, which enhances learning performance. FedProtoKD improved test accuracy by an average of 1.13% and up to 34.13% across various settings, significantly outperforming existing state-of-the-art HFL methods.

Dual-Distilled Heterogeneous Federated Learning with Adaptive Margins for Trainable Global Prototypes

TL;DR

The paper tackles heterogeneity in Federated Learning by introducing FedProtoKD, a framework that combines dual knowledge distillation with adaptive, class-wise trainable prototypes to prevent prototype margin shrink during aggregation. It employs a learnable projection to align heterogeneous feature spaces, a contrastive generator to synthesize server prototypes with per-class adaptive margins, and a variance-weighted logit aggregation plus quality-aware public-data prioritization to distill robust global knowledge. The approach is validated on CIFAR-10/100 and Tiny-ImageNet across extreme and moderate non-IID settings, showing consistent improvements in server and client accuracy and demonstrating robustness to model and data heterogeneity as well as scalability factors. Overall, FedProtoKD advances prototype-based HFL by maintaining discriminative class boundaries and efficient cross-model knowledge transfer while preserving client privacy.

Abstract

Heterogeneous Federated Learning (HFL) has gained significant attention for its capacity to handle both model and data heterogeneity across clients. Prototype-based HFL methods emerge as a promising solution to address statistical and model heterogeneity as well as privacy challenges, paving the way for new advancements in HFL research. This method focuses on sharing class-representative prototypes among heterogeneous clients. However, aggregating these prototypes via standard weighted averaging often yields sub-optimal global knowledge. Specifically, the averaging approach induces a shrinking of the aggregated prototypes' decision margins, thereby degrading model performance in scenarios with model heterogeneity and non-IID data distributions. The propose FedProtoKD in a Heterogeneous Federated Learning setting, utilizing an enhanced dual-knowledge distillation mechanism to enhance system performance by leveraging clients' logits and prototype feature representations. The proposed framework aims to resolve the prototype margin-shrinking problem using a contrastive learning-based trainable server prototype by leveraging a class-wise adaptive prototype margin. Furthermore, the framework assess the importance of public samples using the closeness of the sample's prototype to its class representative prototypes, which enhances learning performance. FedProtoKD improved test accuracy by an average of 1.13% and up to 34.13% across various settings, significantly outperforming existing state-of-the-art HFL methods.

Paper Structure

This paper contains 31 sections, 1 theorem, 16 equations, 8 figures, 11 tables, 1 algorithm.

Key Result

Proposition 1

Boundedness of Gradients under Adaptive Clipping: For the loss function defined in Eq. eq:class_wise_adaptive_margin_proposed, the gradient magnitude with respect to the generated prototype $\nabla_{\tilde{P}^c} \mathcal{L}$ is strictly bounded, provided the adaptive margin $\xi^c(t)$ satisfies the

Figures (8)

  • Figure 1: Illustrating normalized prototype margins across CIFAR-10 classes (0–9), where the concentric grid lines (0.0 to 1.0) represent the magnitude of the normalized margin. The prototype margin is defined as the minimum Euclidean distance separating distinct class prototypes, while the maximum margin denotes the highest value observed across all clients per class. The blue line represents the maximum margin observed among all local clients and, the red line depicts the aggregated prototype margin of FedProto, indicating the baseline margin shrinkage that occurs during standard aggregation. FedProtoKD generates global prototypes with enhanced inter-class separability, thereby maximizing geometric distinguishability and mitigating the shrinkage problem.
  • Figure 2: Overview of the FedProtoKD framework. (1) Client Knowledge Transfer: Clients compute local prototypes and logits using unlabeled public data and transmit them to the server. (2) Server Aggregation: The server employs a dual-branch strategy: the ACTP module synthesizes adaptive global prototypes via a non-linear generator, while the logit branch prioritizes high-quality samples based on prediction variance. These components are fused via Dual Knowledge Distillation. (3) Server Knowledge Transfer: The refined global knowledge is distilled back to clients to align local feature representations.
  • Figure 3: t-SNE visualization of local class prototypes after integrating global prototype updates, where each point represents the feature centroid learned by a client for a specific class. While Fed2PKD and FedTGP exhibit crowded clusters indicative of margin shrinkage, FedProtoKD maintains distinct separation. This confirms that the ACTP mechanism effectively enforces geometric separability, preventing margin collapse.
  • Figure 4: The intra-logit variance exhibits a higher confidence score, and its correlation of 0.9744 aligns closely with the intuition behind variance-based logit aggregation.
  • Figure 5: The plot visualizes the L2 distance between public sample feature vectors and their corresponding global class prototypes $\tilde{P}^c$. Samples are categorized relative to the class mean distance, which is marked by a blue diamond. Green points represent high-quality samples located closer to the prototype, while red points indicate lower-quality samples situated further away. The significant variance observed within classes demonstrates that samples possess unequal reliability, motivating the proposed sample-wise importance weighting mechanism $\mathcal{I}_i$ for the Global Knowledge Distillation loss.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof