Table of Contents
Fetching ...

MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling

Rachel S. Y. Teo, Tan M. Nguyen

TL;DR

MoLEx addresses the high cost of adapting large pre-trained language models by introducing a sparse mixture of layer experts that upcycles every layer of the backbone. It reuses existing layers as experts and mixes them with a learnable gate, yielding conditional computation with minimal parameter overhead, while enabling inter-layer information exchange through a mixing coefficient $\alpha$ and a TopK router. The authors provide a linear-ensemble robustness theory, probe the linguistic information exchanged by mixing layers, and demonstrate improved accuracy and transferability on GLUE, E2E, and zero-shot tasks, with public code. This approach offers a practical, scalable, and interpretable alternative to conventional PEFT methods, enhancing fine-tuning performance without increasing the effective parameter budget. The method shows negligible additional overhead when parallelized and scales to larger models, making it appealing for real-world deployment.

Abstract

Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.

MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling

TL;DR

MoLEx addresses the high cost of adapting large pre-trained language models by introducing a sparse mixture of layer experts that upcycles every layer of the backbone. It reuses existing layers as experts and mixes them with a learnable gate, yielding conditional computation with minimal parameter overhead, while enabling inter-layer information exchange through a mixing coefficient and a TopK router. The authors provide a linear-ensemble robustness theory, probe the linguistic information exchanged by mixing layers, and demonstrate improved accuracy and transferability on GLUE, E2E, and zero-shot tasks, with public code. This approach offers a practical, scalable, and interpretable alternative to conventional PEFT methods, enhancing fine-tuning performance without increasing the effective parameter budget. The method shows negligible additional overhead when parallelized and scales to larger models, making it appealing for real-world deployment.

Abstract

Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.

Paper Structure

This paper contains 34 sections, 7 theorems, 18 equations, 3 figures, 13 tables.

Key Result

Lemma 1

Consider a prediction function $f$, classifier head $H$, data point $({\bm{x}},y) \in ({\bm{X}}, {\bm{Y}})$ and a perturbed point $\tilde{{\bm{x}}} \in B({\bm{x}}, \epsilon)$. If $F({\bm{x}}) = H(f({\bm{x}}))=y$, then $F$ is $\epsilon$-Robust at ${\bm{x}}$ if and only if

Figures (3)

  • Figure 1: (a) A naive parameter efficient fine-tuning model with $T$ layers, $u_0$, $u_1$, $\cdots$, $u_{T-1}$ and input $z_0$. $z_t$, for $t=1,2,\cdots,T$ are the outputs of each layer. (b) A MoLEx model transformed from a parameter efficient fine-tuning model with $T$ layers, $u_0$, $u_1$, $\cdots$, $u_{T-1}$ and input $z_0$. $z_t$, for $t=1,2,\cdots,T$ are the outputs of each MoLEx layer. At each layer, the input to the layer is processed by a gate $g$ to select the top-1 layer expert and the outputs of the layer and the selected layer are linearly combined and weighted by $\alpha$ and $1-\alpha$ respectively. In the diagram, at layer $u_1$, layer $u_{T-1}$ is chosen by the gate for mixing. Then, the outputs of layer $u_1$ and layer $u_{T-1}$ are summed after multiplying them with $\alpha$ and $1-\alpha$ respectively.
  • Figure 2: Heat maps to visualize the percentage of time each layer expert is chosen at every layer of MoLEx when fine-tuning RoBERTa-base on GLUE tasks, CoLA, STS-B and RTE. As one expert is fixed to be the original layer, the x-axis corresponds to the sequential layer while the y-axis corresponds to the layer experts.
  • Figure 3: Plots of heat maps to visualize the percentage of time each layer expert is chosen at every layer of MoLEx when fine-tuning RoBERTa-base on all GLUE tasks. As one expert is fixed to be the original layer, the x-axis corresponds to the sequential layer while the y-axis corresponds to the layer experts. The darker a square is, the more often that layer is chosen by the gate during inference. For example, when fine-tuning on CoLA, layer 9 mixes with layer 2, 100% of the time. The grids are partitioned into thirds along the x-axis and y-axis for easy visualization of early, middle and later layers.

Theorems & Definitions (12)

  • Definition 1: $\epsilon$-Robustness
  • Definition 2: Linear MoLEx as an Ensemble Model
  • Lemma 1: Robustness condition for classifier model
  • Theorem 1: Linear ensembles are more robust
  • Corollary 1: Sufficient conditions for $\epsilon$-robustness
  • Corollary 2: Linear MoLEx is more robust than sequential model
  • Theorem 1: Linear ensembles are more robust than base models
  • proof
  • Corollary 1: Sufficient conditions for $\epsilon$-robustness
  • proof
  • ...and 2 more