Table of Contents
Fetching ...

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Ustun, Acyr Locatelli

TL;DR

BAM addresses the inefficiency of prior MoE upcycling by fully leveraging dense seed models, incorporating both FFN and attention parameters into expert modules. It introduces a parallel attention transformer and soft routing for attention experts, plus KV sharing variants to balance accuracy and inference efficiency. Across seeds from 590M to 2B parameters, BAM consistently surpasses the BTX baseline on perplexity and downstream tasks under equivalent data and compute, demonstrating the practical value of upcycling attention as well as FFN components. This approach enables more effective domain specialization while maintaining throughput via parallel computation, offering a scalable path for deploying Mixture of Experts in large-scale language models.

Abstract

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

TL;DR

BAM addresses the inefficiency of prior MoE upcycling by fully leveraging dense seed models, incorporating both FFN and attention parameters into expert modules. It introduces a parallel attention transformer and soft routing for attention experts, plus KV sharing variants to balance accuracy and inference efficiency. Across seeds from 590M to 2B parameters, BAM consistently surpasses the BTX baseline on perplexity and downstream tasks under equivalent data and compute, demonstrating the practical value of upcycling attention as well as FFN components. This approach enables more effective domain specialization while maintaining throughput via parallel computation, offering a scalable path for deploying Mixture of Experts in large-scale language models.

Abstract

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.
Paper Structure (31 sections, 4 equations, 1 figure, 10 tables)

This paper contains 31 sections, 4 equations, 1 figure, 10 tables.

Figures (1)

  • Figure 1: BAM operates in three phases. Different colors correspond to different expert domains, which indicates the pre-trained seed model. White indicates random parameter initialization, gradient color indicates parameter merging 1) Branching: Begin with a pre-trained dense seed model and create $N$ copies of it. 2) Continued Pre-training: Continue to pre-train each copy independently on its own data mixture. This process yields specialized dense expert models. 3) Mixture Model Training: Utilize these specialized dense expert models to initialize both the FFN and attention experts of the mixture model. The router layers are initialized randomly. All other parameters are derived by averaging the corresponding layers in each of the dense experts. Note that BAM employs a parallel attention transformer architecture that concurrently computes attention experts and FFN experts. The figure is loosely based on Figure 1 from sukhbaatar2024branch.