Table of Contents
Fetching ...

Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng

TL;DR

Mosaic tackles federated learning under simultaneous data and model heterogeneity by using per-client lightweight generators to synthesize privacy-preserving data and forming a Mixture-of-Experts teacher from class-specific client models. A prototype-informed meta model fuses expert predictions, and knowledge distillation then transfers collective knowledge to a global student using the generator ensemble. The approach relies on a one-shot generator upload to reduce communication and avoid unstable aggregation, while leveraging an ensemble-based, robust teacher to improve generalization across heterogeneous clients. Empirical results on seven image-classification benchmarks show Mosaic achieving state-of-the-art performance under challenging heterogeneity regimes, with strong robustness and practical privacy advantages, and it opens avenues for privacy-preserving, scalable FL in diverse hardware environments.

Abstract

Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.

Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

TL;DR

Mosaic tackles federated learning under simultaneous data and model heterogeneity by using per-client lightweight generators to synthesize privacy-preserving data and forming a Mixture-of-Experts teacher from class-specific client models. A prototype-informed meta model fuses expert predictions, and knowledge distillation then transfers collective knowledge to a global student using the generator ensemble. The approach relies on a one-shot generator upload to reduce communication and avoid unstable aggregation, while leveraging an ensemble-based, robust teacher to improve generalization across heterogeneous clients. Empirical results on seven image-classification benchmarks show Mosaic achieving state-of-the-art performance under challenging heterogeneity regimes, with strong robustness and practical privacy advantages, and it opens avenues for privacy-preserving, scalable FL in diverse hardware environments.

Abstract

Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.

Paper Structure

This paper contains 39 sections, 3 theorems, 29 equations, 23 figures, 15 tables, 3 algorithms.

Key Result

Theorem I.1

Under Assumptions 1 and 2, the meta-enhanced ensemble achieves lower expected prediction variance than the vanilla ensemble: with equality if and only if all experts are homogeneous, i.e., $\sigma_1^2 = \dots = \sigma_k^2$ and $\delta_1 = \dots = \delta_k$.

Figures (23)

  • Figure 1: The full workflow for Mosaic combined with a PT-based method. Mosaic consists of four stages: local model update, generator optimization, model aggregation, and knowledge distillation. Notably, during generator optimization, the local model is updated and produces $\mathcal{L}_{\text{adv}}$, while other losses $\mathcal{L}_{\text{entropy}}$, $\mathcal{L}_{\text{diversity}}$, and $\mathcal{L}_{\text{inversion}}$ are computed using a frozen local model fixed at initialization.
  • Figure 2: Visualization of synthetic data and decision boundaries of the global model $d_s$ and teacher ensemble $d_t$. Left panel: red circles indicate data synthesized by the aggregated global generator, while gray circles reflect areas the generator fails to cover due to instability. Middle panel: synthetic data produced by the generator ensemble, with most samples successfully synthesized. Right panel: dashed line denotes the original teacher ensemble, while the solid line represents the MoE-based teacher ensemble with broader decision boundaries.
  • Figure 3: Training dynamics of generators on clients with different data sample sizes.
  • Figure 4: Samples generated by generators trained on different clients.
  • Figure 5:
  • ...and 18 more figures

Theorems & Definitions (6)

  • Theorem I.1
  • proof
  • Lemma I.2
  • proof
  • Corollary I.3
  • proof