Table of Contents
Fetching ...

Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning

Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

TL;DR

The paper tackles inefficiencies in MoE-based multi-task learning stemming from transferring from single-task backbones, which can cause redundant adaptation and gradient conflicts during the STL-to-MTL transition. It introduces Adaptive Shared Experts (ASE) within a LoRA-MoE framework, where sparse and shared experts are gated by router-derived weights and normalized jointly, with a dynamic balance that favors shared expertise early in training and gradually shifts toward task-specific specialization. It further augments the model with fine-grained LoRA experts to boost cooperation under a fixed parameter budget. Empirical results on PASCAL-Context show robust gains (e.g., Seg mIoU up to 74.0 and average $\Delta_m$ around +7.58%) with modest parameter overhead (~4%), validating the approach's effectiveness and scalability for multi-task vision models.

Abstract

Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.

Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning

TL;DR

The paper tackles inefficiencies in MoE-based multi-task learning stemming from transferring from single-task backbones, which can cause redundant adaptation and gradient conflicts during the STL-to-MTL transition. It introduces Adaptive Shared Experts (ASE) within a LoRA-MoE framework, where sparse and shared experts are gated by router-derived weights and normalized jointly, with a dynamic balance that favors shared expertise early in training and gradually shifts toward task-specific specialization. It further augments the model with fine-grained LoRA experts to boost cooperation under a fixed parameter budget. Empirical results on PASCAL-Context show robust gains (e.g., Seg mIoU up to 74.0 and average around +7.58%) with modest parameter overhead (~4%), validating the approach's effectiveness and scalability for multi-task vision models.

Abstract

Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.

Paper Structure

This paper contains 9 sections, 11 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An illustration of the proposed adaptive shared experts. The frozen FFN $\bm{\mathrm{W}}_0$ is combined with sparse experts $E_i(\cdot;\,\bm{\mathrm{A}}_i,\bm{\mathrm{B}}_i)$ and adaptive shared experts $E_i^s(\cdot;\,\bm{\mathrm{A}}^{s}_i,\bm{\mathrm{B}}^{s}_i)$, whose contributions are assigned adaptive gating weights computed by task-specific routers and normalized jointly with sparse experts.
  • Figure 2: Comparison between baseline, naive shared expert, and the proposed ASE. BCE is used as the evaluation metric for Edge. The $\Delta_m$ is computed with respect to a ViT-based STL model.
  • Figure 3: ASE under different top-$\bm{k}$ and fine-grained settings. We adopt BCE for Edge, and compute $\Delta_m$ relative to a ViT-based STL model. (a) Line plot of $\Delta_m$ versus top-$k$ with $S{=}1$. The x-axis shows the chosen top-$k$ and the y-axis shows $\Delta_m$. (b) Bar chart of $\Delta_m$ under fine-grained allocations.
  • Figure 4: Activation frequency of experts across layers. We visualize the activation frequency of experts for task Norm. across all layers at epochs 1, 3, and 40 (out of 40). Here, the y-axis represents the layer index, and the x-axis represents the 16 experts, including one ASE. The color bar indicates the normalized frequency.