Table of Contents
Fetching ...

PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

Yilun Liu, Yunpu Ma, Shuo Chen, Zifeng Ding, Bailan He, Zhen Han, Volker Tresp

TL;DR

A unified framework for integrating PEFT modules directly into the MoE mechanism and introduces Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models.

Abstract

The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-8$\times$7B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.

PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

TL;DR

A unified framework for integrating PEFT modules directly into the MoE mechanism and introduces Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models.

Abstract

The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-87B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.

Paper Structure

This paper contains 24 sections, 14 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of a default MoE layer and the PERFT family. PERFT-R, the primary variant, holds an independent routing among the introduced PEFT experts. PERFT-E embeds PEFT experts within the original MoE module and directly utilizes its routing patterns. PERFT-D and PERFT-S simply work as independent shared expert(s) alongside the MoE module.
  • Figure 2: The unified framework of PEFT for a MoE module.a. Functional strategies specify the internal implementation of the introduced PEFT module. b. Compositional strategies describe the PEFT module's interaction with the original MoE mechanism.
  • Figure 3: The dynamics between key memory vectors in experts and expert vectors in routers.a. A dense FFN expert as projecting $\bm{h}^t\in\mathbb{R}^D$ onto $D_a$ key memory vectors in the weight matrix $\bm{W}_\text{up}=\{\bm{k}_i\in\mathbb{R}^D\}$ and yielding activation scores $\bm{a}^t\in\mathbb{R}^{D_a}$ distributed over the key memories. b. A router for $N$ FFN experts as projecting $\bm{h}^t$ onto $N$ expert vectors stored in router weight matrix $\bm{W}_g=\{\bm{g}_i\in\mathbb{R}^D\}$, yielding token-to-expert affinity scores $\bm{s}^t\in\mathbb{R}^{N}$ distributed over the experts. Each expert vector $\bm{g}_i$ symbolizes a characteristic $\bm{h}^t$ pattern featuring its expert's key memory vectors $\{\bm{k}_j\}_i$. c. Routers for both the $N$ FFN experts and $M$ PEFT experts introduce interesting dynamics between their expert vectors $\{\bm{g}_i\}$ and $\{\tilde{\bm{g}}_i\}$, resulting a more flexible space for fine-tuning.
  • Figure 4: Performance comparison of OLMoE-1B-7B fine-tuned with baselines and PERFT family. Performance on $y$-axes is averaged across corresponding evaluation benchmarks; "Activated Parameter Efficiency" on $x$-axes indicates the ratio of activated trainable parameters to the total activated parameters. Color represents different methods: "qvLoRA" stands for applying LoRA on attention matrices $\bm{W}_q$ and $\bm{W}_v$; "S", "D", "R" and "E" refer to the proposed PERFT variants. Transparency indicates different sparsity levels (ratio of activated experts $K/N$, as "(TopK/N)" labeled for PERFT-R and PERFT-E). Marker size indicates bottleneck size $D_B$.
  • Figure 5: Performance comparison of configurations with different total number of PEFT experts in PERFT-R. Results from OLMoE-1B-7B fine-tuned with PERFT-R for commonsense reasoning. $x$-axes stand for activated parameter efficiency. Transparency represents different sparsity levels (ratio of activated PEFT experts), and marker size represents bottleneck size $D_B$.
  • ...and 3 more figures