KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

Ramchand Kumaresan

KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

Ramchand Kumaresan

Abstract

Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.

KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

Abstract

Paper Structure (96 sections, 4 equations, 15 figures, 27 tables)

This paper contains 96 sections, 4 equations, 15 figures, 27 tables.

Introduction
The core insight.
The protocol.
Key results.
Contributions.
Related Work
Branch-Train-Mix (BTX).
MoErging and PHATGOOSE.
Pari thesis.
Weight interpolation methods.
Multilingual cooperative training.
Federated learning.
FuseLLM.
Sparse Upcycling.
STAR and related concurrent work.
...and 81 more sections

Figures (15)

Figure 1: LoRA rank ablation: fusion gain vs. best specialist (%) and mean specialist divergence from base (%) across fine-tuning methods (Pythia-410M, seed 42, 2,000 training steps). No LoRA rank produces sufficient divergence ($\geq$3.3%) for positive fusion gain. Full fine-tuning achieves $+15.65\%$ divergence and $+7.72\%$ fusion gain.
Figure 2: Kalavai core results.(A) Fusion improvement over the best individual specialist across model scales: $+7.72\%$ at 410M, $+7.49\%$ at 1B, $+6.53\%$ at 6.9B (per-domain equal-weight eval). Gains are proportional to specialist divergence; conversion rate 0.49$\times$ at 410M/1B, 0.75$\times$ at 6.9B. (B) Training duration crossover: freeze=0 peaks at 5k steps ($+17.7\%$) then degrades to $+14.7\%$ at 20k steps; freeze=4 degrades more slowly ($+17.0\%$ at 20k); crossover at $\approx$10k steps. (C) Router architecture: uniform routing (no training) achieves $-1.2\%$ vs. best specialist; trained linear or MLP routers achieve $+7.7\%$; architecture is irrelevant, learning is not. (D)Kalavai vs. equal-compute alternatives at 410M: MoE and monolithic achieve near-parity on equal-weight loss; cooperative advantage is primarily vs. best individual specialist (+7.72%). All results seed 42 or means over 3 seeds where noted.
Figure 3: Cross-lingual fusion results: base model perplexity vs. specialist vs. Kalavai MoE (Pythia-410M, seeds 137/2026). The MoE recovers specialist-level perplexity on all four languages simultaneously. Yoruba improvement 5.4$\times$ (PPL 41.9$\to$7.7); Welsh 4.6$\times$ (102.7$\to$22.1). Improvement annotations show the base$\to$MoE ratio. Seeds 137 and 2026 achieved perfect routing; seed 42 had router collapse (see text).
Figure 4: Fusion gain vs. mean specialist divergence (%) with OLS regression line and 95% prediction band. Linear fit: $\text{gain} = -2.72 + 0.82 \times \text{div}$ ($R^2 = 0.856$, $n=6$ in-sample conditions). English-domain conditions (Qwen, Pythia-6.9B/1B/410M) cluster near the line; Exp 2 (private, purple) and Exp 1 (cross-lingual, red) both lie above the English-domain prediction, consistent with base-model incompetence on target domains producing outsized gains. The cross-lingual condition is the largest in-sample outlier (+3.6pp). Annotations show gain/divergence conversion rate per condition. Note: Exp 3 (20-contributor, div = 15.68%, gain = +16.71%, 3-seed mean) is an out-of-sample validation point lying +6.57pp above the regression line (Table \ref{['tab:divergence_gain']}); it is not shown in this figure as the regression was fit before Exp 3 results were available.
Figure 5: Base-model perplexity as a secondary predictor of cooperative fusion efficiency. Left: Conversion efficiency (gain / divergence) versus mean base-model perplexity per condition. Centre: Same with log-scaled perplexity axis (Pearson $r = +0.560$). Right: Divergence versus gain coloured by base-model PPL quartile. Cross-lingual conditions (high base PPL) convert divergence most efficiently; English-domain conditions (low base PPL) sit near the baseline conversion rate. Dashed lines are OLS fits; $n=6$ conditions.
...and 10 more figures

KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

Abstract

KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

Authors

Abstract

Table of Contents

Figures (15)