UNIDEAL: Curriculum Knowledge Distillation Federated Learning
Yuwen Yang, Chang Liu, Xun Cai, Suizhi Huang, Hongtao Lu, Yue Ding
TL;DR
UNIDEAL tackles cross-domain Federated Learning with heterogeneous model architectures by decoupling parameters and sharing only task head parameters, enabling flexible per-client feature extractors. It introduces Adjustable Teacher-Student Mutual Evaluation Curriculum Learning (CLKD), which uses batch-wise mutual evaluation scores and a cosine-based similarity metric to progressively supervise local heads with a global teacher during knowledge distillation, while linearly decaying the training subset from easy to hard samples. Empirical results across image and tabular cross-domain tasks show that UNIDEAL consistently surpasses state-of-the-art baselines in accuracy and communication efficiency, with CLKD based on cosine similarity providing the strongest gains. The paper also extends the approach to heterogeneous architectures (UNIDEAL-HETE) and proves a non-convex convergence rate of $O(\frac{1}{T})$, highlighting practical impact for scalable, privacy-preserving collaborative learning.
Abstract
Federated Learning (FL) has emerged as a promising approach to enable collaborative learning among multiple clients while preserving data privacy. However, cross-domain FL tasks, where clients possess data from different domains or distributions, remain a challenging problem due to the inherent heterogeneity. In this paper, we present UNIDEAL, a novel FL algorithm specifically designed to tackle the challenges of cross-domain scenarios and heterogeneous model architectures. The proposed method introduces Adjustable Teacher-Student Mutual Evaluation Curriculum Learning, which significantly enhances the effectiveness of knowledge distillation in FL settings. We conduct extensive experiments on various datasets, comparing UNIDEAL with state-of-the-art baselines. Our results demonstrate that UNIDEAL achieves superior performance in terms of both model accuracy and communication efficiency. Additionally, we provide a convergence analysis of the algorithm, showing a convergence rate of O(1/T) under non-convex conditions.
