CoBo: Collaborative Learning via Bilevel Optimization

Diba Hashemi; Lie He; Martin Jaggi

CoBo: Collaborative Learning via Bilevel Optimization

Diba Hashemi, Lie He, Martin Jaggi

TL;DR

We address collaborative learning with heterogeneous clients by a bilevel optimization framework that jointly selects collaborators and trains personalized models. The outer problem optimizes personalized models while the inner problem yields adaptive pairwise collaboration weights based on gradient alignment. CoBo, an SGD-style alternating algorithm, achieves convergence guarantees and scales to large numbers of clients, delivering up to $9.3\%$ accuracy improvement on a highly heterogeneous 80-client task. Empirical results across cross-silo, cross-device, and language-model fine-tuning demonstrate competitive performance against strong personalized baselines and reveal interpretable cluster-aware collaboration patterns.

Abstract

Collaborative learning is an important tool to train multiple clients more effectively by enabling communication among clients. Identifying helpful clients, however, presents challenging and often introduces significant overhead. In this paper, we model client-selection and model-training as two interconnected optimization problems, proposing a novel bilevel optimization problem for collaborative learning. We introduce CoBo, a scalable and elastic, SGD-type alternating optimization algorithm that efficiently addresses these problem with theoretical convergence guarantees. Empirically, CoBo achieves superior performance, surpassing popular personalization algorithms by 9.3% in accuracy on a task with high heterogeneity, involving datasets distributed among 80 clients.

CoBo: Collaborative Learning via Bilevel Optimization

TL;DR

accuracy improvement on a highly heterogeneous 80-client task. Empirical results across cross-silo, cross-device, and language-model fine-tuning demonstrate competitive performance against strong personalized baselines and reveal interpretable cluster-aware collaboration patterns.

Abstract

Paper Structure (21 sections, 4 theorems, 52 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 4 theorems, 52 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Problem formulation
Outer problem: training personalized models.
Inner Problem: Finding Collaborators
Algorithm
Theoretical results
Experiments
Cross-silo federated learning experiment with 8 clients
Cross-device experiment experiment with 80 clients
Collaborative fine-tuning on language models
Related Work
Personalized federated learning.
Federated Learning with Client Selection
Bilevel optimization and alternating optimization.
Conclusions
...and 6 more sections

Key Result

Theorem 1

Suppose Assumption a:smoothness,a:noise-bound,a:global_minimum,a:collaborative,a:cluster hold true. Suppose that CoBo solves eq:w_ with mini-batch size $b$. Consider clients $i$ and $j$ in the same cluster $\mathcal{C}$ of size $c$. Suppose that $M_{ij}^2\in(0,\frac{1}{5})$, $b\ge \frac{2}{c^2}2L\et The consensus distance also converges to 0, i.e. Moreover, the gradient norm is upper bounded.

Figures (6)

Figure 1: Diagram of the inner problem \ref{['eq:inner']} represented through a contour of $\frac{1}{2}(f_1+f_2)$. The blue arrows $\rightarrow$ are gradients computed at middle point $\frac{1}{2}(\bm{x}_1+\bm{x}_2)$ to determine connectivity. The red arrows $\rightarrow$ represent gradients computed at local models to update model weights.
Figure 2: (\ref{['subfig:cross-silo-ablation']}) Average accuracy in cross-silo experiments with varying factors, including the fraction of the dataset available to clients, the number of clusters, and the number of clients per cluster. (\ref{['subfig:acc8']}) Average accuracy of personalized models for cross-silo federated learning with 8 clients. The "Oracle" denotes applying FedAvg to the clients with the same label permutation.
Figure 3: Collaboration matrices learned by Federated Clustering (FC), IFCA, and CoBo at different stages of training for cross-silo experiment with 8 clients. The diagonals are masked out. The oracle matrix is a block diagonal matrix with blocks of size 2. The collaboration matrix of CoBo already starts to look similar to oracle matrix within as low as 300 iterations (0.75% of the total iterations), and converges to it within 5000 iterations (12.5% of the total iterations). On the other hand, IFCA yields a fully-connected matrix while FC occasionally diverges from the achieved cluster structures (e.g., iterations 300, 5000, and 40000), even at the end of training.
Figure 4: Domain weights found by CoBo for Catalan language. There are 4 domains in total: Catalan, Spanish, German, and Dutch. The curves are smoothed by exponential moving average.
Figure 5: Collaboration matrices learned by CoBo at different stages of training for cross-device experiment with 80 clients. The diagonals are masked out. The oracle matrix is a block diagonal matrix, consisting of 10 blocks: two blocks of size 10, two blocks of size 9, and so on. The collaboration matrix of CoBo already starts to look similar to oracle matrix within as low as 300 iterations. (1.5% of the total iterations)
...and 1 more figures

Theorems & Definitions (10)

Remark 1: Extensions
Example 2
Theorem 1
Corollary 2
Lemma 3
proof
Lemma 4
proof
proof
proof

CoBo: Collaborative Learning via Bilevel Optimization

TL;DR

Abstract

CoBo: Collaborative Learning via Bilevel Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (10)