Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Zihan Wang; Deli Chen; Damai Dai; Runxin Xu; Zhuoshu Li; Y. Wu

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu

TL;DR

This work investigates parameter-efficient fine-tuning for sparse MoE LLMs and identifies strong task-specific expert specialization as a key driver of performance. It introduces Expert-Specialized Fine-Tuning (ESFT), which selectively tunes task-relevant experts while freezing others, achieving comparable or superior results to full fine-tuning while substantially reducing training cost. The authors show that MoE models with finer-grained experts better enable selection of relevant expert combinations, boosting both efficiency and effectiveness. Extensive experiments on a DeepSeek-V2-Lite backbone demonstrate ESFT’s strong performance across enhancement and adaptation tasks and its superior general-ability retention relative to FFT and LoRA. The work provides practical insights into expert routing, specialization, and the importance of non-shared parameter training within MoE architectures.

Abstract

Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness. Our code is available at https://github.com/deepseek-ai/ESFT.

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

TL;DR

Abstract

Paper Structure (40 sections, 8 equations, 8 figures, 11 tables)

This paper contains 40 sections, 8 equations, 8 figures, 11 tables.

Introduction
Related Work
Parameter-efficient fine-tuning for dense architectural LLMs
Coarse- and Fine-grained MoE LLMs
Methods
Preliminaries: Mixture-of-Experts for Transformers
Probing Task-Specific Expert Specialization in MoE Models
Expert Routing is Concentrated in the Same Task
Active Experts Vary Significantly across Tasks
Expert-Specialized Fine-tuning (ESFT)
Data Sampling
Expert Relevance Score
Expert Selection and Fine-tuning
Experiment Setup
Main Evaluation
...and 25 more sections

Figures (8)

Figure 1: Comparison between Expert-Specialized Fine-Tuning (ESFT) and other fine-tuning methods. FFT trains all parameters. LoRA combines pre-trained weights with low-rank matrices to reduce training costs. ESFT only trains a subset of experts in a Mixture-of-Expert (MoE) architecture, optimizing efficiency and task specialization.
Figure 2: Top Expert distribution for specific tasks. Shaded areas represent variance across layers. The figure shows that few experts handle most gate values, highlighting expert specialization for different tasks.
Figure 3: The average number of shared Top-6 routed experts across tasks. The values are averaged by layer, indicating that the sets of experts used for the same task are consistent while different tasks are distinct.
Figure 4: Number of experts trained in ESFT across layers and tasks. Earlier computed layers are numbered smaller. Most tasks and layers train 5-15% of experts, demonstrating ESFT's effectiveness in selecting task-related experts.
Figure 5: Computational efficiency results. Blue bars show the training time and green lines show storage space. ESFT performs efficiently in terms of training time and storage space.
...and 3 more figures

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

TL;DR

Abstract

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)