Table of Contents
Fetching ...

BTS: Harmonizing Specialized Experts into a Generalist LLM

Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis

TL;DR

BTS presents a modular approach to composing domain-specific LLM experts into a single generalist model by inserting and training lightweight stitch layers between a frozen seed LLM and frozen experts. The method, which alternates Experts-into-Hub and Hub-into-Experts stitching, preserves the integrity of the original models while enabling efficient cross-domain integration and easy addition/removal of experts. Across multiple benchmarks, BTS achieves the best average generalist performance among merging and upcycling baselines and even outperforms some domain experts on certain tasks. The work also provides extensive ablations and gate-value analyses demonstrating cross-capability emergence and interpretability, suggesting practical pathways for scalable, modular deployment of generalist LLMs.

Abstract

We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.

BTS: Harmonizing Specialized Experts into a Generalist LLM

TL;DR

BTS presents a modular approach to composing domain-specific LLM experts into a single generalist model by inserting and training lightweight stitch layers between a frozen seed LLM and frozen experts. The method, which alternates Experts-into-Hub and Hub-into-Experts stitching, preserves the integrity of the original models while enabling efficient cross-domain integration and easy addition/removal of experts. Across multiple benchmarks, BTS achieves the best average generalist performance among merging and upcycling baselines and even outperforms some domain experts on certain tasks. The work also provides extensive ablations and gate-value analyses demonstrating cross-capability emergence and interpretability, suggesting practical pathways for scalable, modular deployment of generalist LLMs.

Abstract

We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.

Paper Structure

This paper contains 48 sections, 4 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of the BTS algorithm. BTS operates in three phases. Different colors correspond to different expert domains. 1) Branch: Following li2022branch, we begin with a pretrained seed model and create $N$ copies of it. 2) Train Experts: Each copy is independently pretrained on its respective data mixture, resulting in specialized expert models, as described in li2022branch. 2) Stitching: Stitch layers are inserted throughout the layers, alternating between the Experts-into-Hub stitch layer and the Hub-into-Experts stitch layer. Only the stitch layers are updated during this training phase. The BTS model always have a Experts-into-Hub stitch layer as the last layer, as the hub output is returned as the final BTS output.
  • Figure 2: Visualization of how BTS gate values vary when generating a sequence during inference. We inspect the gate values for the last stitch layer over the course of a sequence. The first row plots the gate values for prompt tokens, while the second row plots the gate values for the generated tokens. Each column corresponds to a different prompt, sampled randomly from the corresponding benchmark task.
  • Figure 3: Visualization of the gate values of BTS's final stitch layer for context-switching sequences at inference time. These sequences are constructed by concatenating question-answer examples from Flores (3-shot), GSM8K (2-shot), and TriviaQA (2-shot), in that order, with dotted lines indicating task transitions. Each plot corresponds to a different randomly sampled prompt. This visualization highlights BTS's ability to dynamically adjust expert utilization based on token-level context.