Table of Contents
Fetching ...

Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Minhyuk Seo, Taeheon Kim, Hankook Lee, Jonghyun Choi, Tinne Tuytelaars

TL;DR

FedMosaic tackles data and model heterogeneity in personalized federated learning by coupling RELA, a relevance-guided, gradient-based aggregation scheme, with PQ-LoRA, a shareable, dimension-invariant adapter mechanism that enables knowledge transfer across heterogeneous architectures. It introduces DRAKE, a comprehensive multi-modal FL benchmark with task heterogeneity and distribution shifts to reflect real-world conditions. Empirically, FedMosaic achieves superior personalization and generalization across diverse heterogeneous setups, including cross-family model sharing and large-scale LLM scenarios, while maintaining manageable computation and communication costs through gradient sanitization and compressed PQ-LoRA updates. The work offers a practical pathway for deploying personalized, privacy-preserving multi-modal models in real-world, heterogeneous environments and provides a rich benchmark for future research in federated learning with foundation models.

Abstract

As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.

Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

TL;DR

FedMosaic tackles data and model heterogeneity in personalized federated learning by coupling RELA, a relevance-guided, gradient-based aggregation scheme, with PQ-LoRA, a shareable, dimension-invariant adapter mechanism that enables knowledge transfer across heterogeneous architectures. It introduces DRAKE, a comprehensive multi-modal FL benchmark with task heterogeneity and distribution shifts to reflect real-world conditions. Empirically, FedMosaic achieves superior personalization and generalization across diverse heterogeneous setups, including cross-family model sharing and large-scale LLM scenarios, while maintaining manageable computation and communication costs through gradient sanitization and compressed PQ-LoRA updates. The work offers a practical pathway for deploying personalized, privacy-preserving multi-modal models in real-world, heterogeneous environments and provides a rich benchmark for future research in federated learning with foundation models.

Abstract

As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.

Paper Structure

This paper contains 75 sections, 2 theorems, 23 equations, 23 figures, 34 tables, 4 algorithms.

Key Result

Theorem 1

If the column vectors of matrix $B \in \mathbb{R}^{d_O \times r}$ are orthogonal and the row vectors of matrix $A \in \mathbb{R}^{r \times d_I}$ are orthogonal, then the span of the weight update space of PQ-LoRA, $\text{span}\{\Delta W\}$, has $r^2$ dimension, which is the maximum possible dimensio

Figures (23)

  • Figure 1: Overview of the heterogeneous personalized federated learning scenarios.$L_i$ refers to the local model for the $i_{th}$ client. Clients focus on different tasks (i.e., data heterogeneity) where new data are encountered continuously. In addition to data heterogeneity, model architectures may differ across clients (i.e., model heterogeneity) due to differences in hardware constraints.
  • Figure 1: Comparison of FL benchmarks across key dimensions: Multi-Data Sources (using diverse datasets vs. non-i.i.d. splits of a single dataset), Distribution Shifts (evolving client data distributions), Multi-Image Support (handling multiple images per input), and Unseen Evaluation (testing on tasks unseen during training). See Sec. \ref{['sec:appx:benchmark_comparison']} for the detailed comparisons.
  • Figure 2: Overview of proposed FedMosaic. On every round, the local PQ-LoRA $L_i$ fine-tuned during local training and the sanitized last layer gradient $\tilde{g_i}$ are uploaded from the $i_\text{th}$ client to server. The last layer gradient $g$ is extracted from the small pre-trained model $W_s$, which is then EMA updated to $\hat{g}$ and then compressed to $\tilde{g}$, sequentially. Note that the gradient computation is performed every $m$ iterations. In server, the sanitized gradients $\tilde{g_i}$ are used to measure client task relevance and to build customized global PQ-LoRA $G_i$, which is distributed and kept frozen. $h$ and $W_p$ denote the hidden state input and the pre-trained weight, respectively. $\beta$ is a learnable gating parameter that balances the output from the global and local models, and $|W_p|$ is the number of layers in the model.
  • Figure 3: Illustration of (a) Conventional LoRA and (b) PQ-LoRA. While $A$ and $B$ are trainable in conventional LoRA, PQ-LoRA freezes both, updating only the dimension-invariant modules $P \in \mathbb{R}^{r \times r}$ and $Q \in \mathbb{R}^r$ during training.
  • Figure 4: Layer-wise similarity between Llama-1B and Llama-3B measured with CKA. The diagonal brightest band shows the strongest alignment between layers at similar relative depths (e.g., Llama-1B layer 8 - Llama-3B layer 14).
  • ...and 18 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof