On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

Dongyang Fan; Bettina Messmer; Nikita Doikov; Martin Jaggi

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

Dongyang Fan, Bettina Messmer, Nikita Doikov, Martin Jaggi

TL;DR

CoMiGS tackles on-device, privacy-preserving language model personalization under data and resource heterogeneity by introducing a bi-level Mixture-of-Experts framework that shares generalist knowledge across users while localizing specialist expertise. A token-level router, trained on separate validation data, dynamically assigns tokens to generalists or specialists, with the expert updates driven by on-device training data; alternating minimization solves the bi-level objective and yields linear convergence under mild conditions. Empirically, CoMiGS demonstrates strong performance across in- and out-of-distribution tasks on multilingual and domain-diverse corpora, outperforming FedAvg and other MoE baselines, especially when routing decisions are made at the token level. The framework decouples data quantity from resource availability, enabling high-resource users to leverage larger local models while protecting low-resource users from overfitting, with manageable on-device overhead and half the communication costs. The work also provides theoretical convergence guarantees and open-sources the codebase to foster collaborative, privacy-preserving LLM development on edge devices.

Abstract

On-device LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate private learning with scarce data, Federated Learning has become a standard approach. However, it faces challenges such as computational resource heterogeneity and data heterogeneity among end users. We propose CoMiGS ($\textbf{Co}$llaborative learning with a $\textbf{Mi}$xture of $\textbf{G}$eneralists and $\textbf{S}$pecialists), the first approach to address both challenges. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is optimized using a separate validation set to ensure alignment with the target distribution. We solve our objective with alternating minimization, for which we provide a theoretical analysis. Our method shares generalist experts across users while localizing a varying number of specialist experts, thereby adapting to users' computational resources and preserving privacy. Through extensive experiments, we show CoMiGS effectively balances general and personalized knowledge for each token generation. We demonstrate that CoMiGS remains robust against overfitting-due to the generalists' regularizing effect-while adapting to local data through specialist expertise. We open source our codebase for collaborative LLMs.

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

TL;DR

Abstract

llaborative learning with a

xture of

eneralists and

pecialists), the first approach to address both challenges. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is optimized using a separate validation set to ensure alignment with the target distribution. We solve our objective with alternating minimization, for which we provide a theoretical analysis. Our method shares generalist experts across users while localizing a varying number of specialist experts, thereby adapting to users' computational resources and preserving privacy. Through extensive experiments, we show CoMiGS effectively balances general and personalized knowledge for each token generation. We demonstrate that CoMiGS remains robust against overfitting-due to the generalists' regularizing effect-while adapting to local data through specialist expertise. We open source our codebase for collaborative LLMs.

Paper Structure (53 sections, 5 theorems, 53 equations, 19 figures, 6 tables, 1 algorithm)

This paper contains 53 sections, 5 theorems, 53 equations, 19 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Collaborative Fine-Tuning for LLMs.
Mixture of Global and Local Experts.
Method
Notions and Problem Setup
A Bi-Level Formulation
Our Algorithm
Alternating Update of ${\mathbf{\Theta}}$ and ${\mathbf{\Phi}}$.
Convergence Results
Experiments
Setup
Datasets
In-Distribution Tasks.
Out-of-Distribution Tasks.
...and 38 more sections

Key Result

Theorem 3.1

If Assumptions Assumption-Fixed, Assumption-Contraction hold, and $\lambda_1 \cdot \lambda_2 < 1$, then the weights $({\mathbf{\Theta}}_k, {\mathbf{\Phi}}_{k})$ generated by alternating updates (eq: bi-level-upper-update), (eq: bi-level-lower-update) converge to $({\mathbf{\Theta}}^\star, {\mathbf{\

Figures (19)

Figure 1: Chat box between two users with different characteristics. Next word prediction for smart keyboards should be tailored to users' topic preferences for personalization. However, to ensure factual accuracy and linguistic consistency, the results of next word prediction should maintain universality.
Figure 2: Diagram of our proposed method CoMiGS illustrated with a simplified 2-heterogenous-models setup (corresponding to the two users in Figure \ref{['fig:chatting-box']}). Generalist experts ($\textcolor{white}{${\boldsymbol{\theta}}^{G}_1$}, \textcolor{white}{${\boldsymbol{\theta}}^{G}_2$}$) are aggregated across users, and specialist experts ($\textcolor{white}{$\{{\boldsymbol{\theta}}^{S_i}_1\}_{i=1}^3$}, \textcolor{white}{$\{{\boldsymbol{\theta}}^{S_1}_2\}$}$) and Routers ($\textcolor{white}{${\boldsymbol{\phi}}_1$}, \textcolor{white}{${\boldsymbol{\phi}}_2$}$) are kept local.
Figure 3: Visualization of in-distribution token-level routing results for CoMiGS-1G1S trained on SlimPajama. Tokens are colored with the Top1 expert choice at the first layer (top) and last layer (bottom). Orange denotes the generalist and blue denotes the specialist. Texts are generated by ChatGPT. Further colored text plots are provided in Appendix \ref{['app: expert_specialization']}.
Figure 4: Expert Scores for the generalist expert and the specialist expert, averaged across all tokens and multiple batches for the out-of-distribution task (AG News). X-axis: number of iterations. Top: CoMiGS-1G1S, Bottom: pFedMoE. Darker colors indicate deeper layers.
Figure 5: Test Perplexity vs. the number of iterations. Low and high denote data quantity. Legend denotes $n_i$.
...and 14 more figures

Theorems & Definitions (9)

Theorem 3.1: Convergence under Contraction
Theorem 3.2: Global Convergence for Linear Experts
Remark 1
Theorem 6.1: Theorem \ref{['Theorem-Convergence']}
Example 1
Example 2
Proposition 1
proof
Theorem 6.2

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

TL;DR

Abstract

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (9)