Table of Contents
Fetching ...

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji, Yuanyuan Shi, Fei Miao

TL;DR

LD-MoLE introduces a differentiable dynamic routing mechanism for Mixture of LoRA Experts, replacing fixed TopK routing with Sparsegen-based routing controlled by a token-specific sparsity parameter $\lambda$ predicted by a lightweight MLP. An analytical sparsity loss directly regulates the number of activated experts, enabling adaptive, token- and layer-wise resource allocation while maintaining end-to-end differentiability. Across Llama-3.2-3B and Qwen-3-1.7B, LD-MoLE achieves state-of-the-art average performance on instruction-tuning and sequence classification benchmarks, with robust training stability and efficient routing. The work demonstrates the value of learnable routing for PEFT-MoE setups and points to future directions in pretraining and multi-modal integration.

Abstract

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

TL;DR

LD-MoLE introduces a differentiable dynamic routing mechanism for Mixture of LoRA Experts, replacing fixed TopK routing with Sparsegen-based routing controlled by a token-specific sparsity parameter predicted by a lightweight MLP. An analytical sparsity loss directly regulates the number of activated experts, enabling adaptive, token- and layer-wise resource allocation while maintaining end-to-end differentiability. Across Llama-3.2-3B and Qwen-3-1.7B, LD-MoLE achieves state-of-the-art average performance on instruction-tuning and sequence classification benchmarks, with robust training stability and efficient routing. The work demonstrates the value of learnable routing for PEFT-MoE setups and points to future directions in pretraining and multi-modal integration.

Abstract

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.

Paper Structure

This paper contains 29 sections, 3 theorems, 25 equations, 7 figures, 6 tables.

Key Result

Proposition 1

Let ${\bm{u}} \in \mathbb{R}^E$ in Eq. eq:linear_proj be the expert scores associated with token ${\bm{x}}$, and let ${\bm{u}}_{(1)}\ge \cdots\ge {\bm{u}}_{(E)}$ be the sorted coordinates of ${\bm{u}}$. Define the cumulative sums $U_k = \sum_{i=1}^k {\bm{u}}_{(i)}$ for $k=1,\dots,E$. Then the Sparse where $[x]_+ = \max(x,0)$, and the threshold $\tau$ is determined as such that ${\bm{p}}$ lies on

Figures (7)

  • Figure 1: The overview of the LD-MoLE architecture, which enables Learnable Dynamic Routing (details in Section 3 and Fig \ref{['fig:framework']} (c)) for LoRA adapters with the Mixture-of-Experts paradigm.
  • Figure 2: (a) Standard TopK routing activates a fixed number ($K$) of experts using non-differentiable selection. (b) Sparsegen routing introduces a differentiable projection onto the probability simplex, controlled by a sparsity parameter $\lambda$, which enables adaptive expert selection. (c) In the Sparsegen routing module, for each token, a lightweight shared MLP predicts the sparsity factor $\lambda$. Together with the logits $\mathbf{u}$, $\lambda$ determines the probability simplex $\mathbf{p}$ over LoRA experts, enabling dynamic, token-dependent expert allocation across layers. The detailed mathematical formulation is provided in Section \ref{['sec:sparse_gen']}.
  • Figure 3: Layer-wise $\lambda$ values for K, gate, and down projections.
  • Figure 4: Average number of LoRA experts selected per token across layers.
  • Figure 5: Correlation between the frequency of the top 200 most common tokens and their average number of activated experts. Each scatter point represents the average number of experts activated for a given token.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Proposition 1: Closed-form Sparsegen routing: Proposition 0.1 in sparsegen
  • proof
  • Lemma 1: Sparsegen selects at least one expert.
  • Proposition 2: $k$ expert activation
  • proof
  • proof
  • proof