From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik
TL;DR
DynaMoE introduces a post-training framework that converts a dense LLM into a token-difficulty-driven Mixture-of-Experts by partitioning MLP layers into nested FFN experts and training a lightweight token-difficulty router. The router, supervised by derived token difficulty labels, dynamically routes tokens to appropriately sized experts, controlled by a sensitivity parameter $\theta$ to balance efficiency and accuracy. With only $10$B tokens for fine-tuning and minimal router parameters, DynaMoE achieves adaptive variants that match or approach the performance of more costly post-training methods like Flextron while significantly reducing fine-tuning costs. The approach demonstrates robust token difficulty discrimination, dynamic expert usage across layers, and meaningful efficiency-accuracy trade-offs, enabling deployment under diverse resource constraints.
Abstract
Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
