Table of Contents
Fetching ...

FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models

Annemette Brok Pirchert, Jacob Nielsen, Mogens Henrik From, Lukas Galke Poech, Peter Schneider-Kamp

TL;DR

FlexMoRE addresses data-governance constraints by enabling federated training with rank-heterogeneous experts, combining a shared public base with low-rank adapters or full-size specialists. Derived via post-hoc adapter extraction (PHLoRA), each domain expert augments the base with a rank-appropriate correction, allowing flexible routing and inference-time composition. Regression analyses reveal task-typical rank needs: reasoning-heavy benchmarks benefit from higher ranks while knowledge-driven tasks saturate earlier, yielding meaningful memory savings without sacrificing performance. Empirically, FlexMoRE matches or surpasses full-size MoE baselines while using roughly one third the parameters, enabling scalable, decentralized LLM specialization with practical impact for regulated domains.

Abstract

Recent advances in mixture-of-experts architectures have shown that individual experts models can be trained federatedly, i.e., in isolation from other experts by using a common base model to facilitate coordination. However, we hypothesize that full-sized experts may not be necessary for all domains and that instead low-rank adapters may be sufficient. Here, we introduce FlexMoRE, a Flexible Mixture of Rank-heterogenous Experts, which may be either full-sized experts or adapters of a suitable rank. We systematically investigate the trade-off between expert rank and downstream task performance by evaluating $6$ experts with ranks $2^0$ to $2^{14}$ resulting in experiments covering 150 mixtures (96 with 2 experts, 54 with 7 experts) that are evaluated across $120$ tasks. For our experiments, we build on FlexOlmo and turn its pre-trained experts into low-rank versions. Our regression analysis from expert rank to downstream task performance reveals that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks. These findings on rank sensitivity come with direct implications for memory efficiency: Using optimal ranks, FlexMoRE yields improved downstream task performance (average score $47.18$) compared to the baseline FlexOlmo-style mixture of full-sized experts (average score $45.46$) at less than one third the parameters ($10.75$B for FlexMoRE vs. $33.27$B for FlexOlmo). All code will be made available.

FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models

TL;DR

FlexMoRE addresses data-governance constraints by enabling federated training with rank-heterogeneous experts, combining a shared public base with low-rank adapters or full-size specialists. Derived via post-hoc adapter extraction (PHLoRA), each domain expert augments the base with a rank-appropriate correction, allowing flexible routing and inference-time composition. Regression analyses reveal task-typical rank needs: reasoning-heavy benchmarks benefit from higher ranks while knowledge-driven tasks saturate earlier, yielding meaningful memory savings without sacrificing performance. Empirically, FlexMoRE matches or surpasses full-size MoE baselines while using roughly one third the parameters, enabling scalable, decentralized LLM specialization with practical impact for regulated domains.

Abstract

Recent advances in mixture-of-experts architectures have shown that individual experts models can be trained federatedly, i.e., in isolation from other experts by using a common base model to facilitate coordination. However, we hypothesize that full-sized experts may not be necessary for all domains and that instead low-rank adapters may be sufficient. Here, we introduce FlexMoRE, a Flexible Mixture of Rank-heterogenous Experts, which may be either full-sized experts or adapters of a suitable rank. We systematically investigate the trade-off between expert rank and downstream task performance by evaluating experts with ranks to resulting in experiments covering 150 mixtures (96 with 2 experts, 54 with 7 experts) that are evaluated across tasks. For our experiments, we build on FlexOlmo and turn its pre-trained experts into low-rank versions. Our regression analysis from expert rank to downstream task performance reveals that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks. These findings on rank sensitivity come with direct implications for memory efficiency: Using optimal ranks, FlexMoRE yields improved downstream task performance (average score ) compared to the baseline FlexOlmo-style mixture of full-sized experts (average score ) at less than one third the parameters (B for FlexMoRE vs. B for FlexOlmo). All code will be made available.
Paper Structure (35 sections, 5 equations, 4 figures, 9 tables)

This paper contains 35 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: FlexMoRE follows a standard MoE architecture, similarity FlexOlmo, utilizing the domain-informed router, but routing to one or more group(s) with base expert and rank-heterogeneous experts.
  • Figure 2: Unweighted average performance of FlexMoRE models with $2$, $4$, and $7$ active experts across six benchmarks. The solid curve shows performance under homogeneous post-hoc LoRA rank tuning. Dashed horizontal lines correspond to heterogeneous FlexMoRE compositions, with the dotted line indicating experts selected based on performance across all benchmarks (All) and the dashed line experts selected using MC9 (MC9). The FlexOlmo baseline is shown for reference.
  • Figure 3: Typical log$_2$ LoRA rank at which experts achieve peak performance. For each expert $e$ and evaluation group $g$, the peak rank is computed directly from the observed scores after sorting by rank and resolving ties by selecting the lowest rank. Peak performance typically occurs at moderate ranks: for the aggregated average (Avg), the median peak is at $\log_2 r=9$ with IQR $[6.75,10.50]$ (i.e., ranks $\approx 2^7$–$2^{10}$). Knowledge-oriented benchmarks peak earlier (e.g., MMLU: median $\log_2 r=2$, IQR $[1.25,8.00]$; GEN5: median $\log_2 r=5$, IQR $[4.25,5.75]$), while reasoning-heavy benchmarks peak at substantially higher ranks (e.g., BBH: median $\log_2 r=11.5$, IQR $[7.25,12.00]$).
  • Figure 4: Performance across all six benchmarks illustrating expert specialization and rank sensitivity. AGIEval emphasizes knowledge-intensive and academic-style tasks, where the experts like perform strongly. BBH focuses on structured reasoning, favoring the Code and Math experts. GEN5 captures open-ended and generative abilities, where the Creative Writing and Reddit experts are most competitive. MC9 evaluates mixed-task performance and favors rank-efficient generalist experts with broad cross-domain utility.