Table of Contents
Fetching ...

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

Md Kowsher, Haris Mansoor, Nusrat Jahan Prottasha, Ozlem Garibay, Victor Zhu, Zhengping Ji, Chen Chen

Abstract

MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

Abstract

MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

Paper Structure

This paper contains 55 sections, 3 theorems, 87 equations, 12 figures, 21 tables.

Key Result

Theorem 1.1

Let $X:\Omega\to\mathbb{R}^d$ and $Y:\Omega\to\mathcal{Y}$ with $Y$ discrete and $H(Y)<\infty$. Fix $n \geq 2$. Let $r_{n-1}:\mathbb{R}^d\to\{1,\dots,n-1\}$ and $r_n:\mathbb{R}^d\to\{1,\dots,n\}$ be measurable routers, and let where $A^{(n-1)}_j, A^{(n)}_e \in\mathbb{R}^{d\times d}$ are fixed linear maps. Assume: Then $\blacktriangleleft$$\blacktriangleleft$

Figures (12)

  • Figure 1: LiME is compatible with any PEFT method; we use LoRA only as an example. (a) MoE-LoRA replicates LoRA adapters ($A_i,B_i$) for each expert and uses a learned router, requiring $E\times|\phi|$ adapter parameters plus ${d_{\text{i}}}\times E$ router parameters. (b) LiME shares a single PEFT module (LoRA here) and uses lightweight expert modulators $\textcolor{trainpurple}{\mathbf{p}}_i\in\mathbb{R}^{d_{\text{o}}}$, reducing trainable MoE parameters to $|\phi|+E {d_{\text{o}}}$. Router reuse: routing is computed directly from representations already produced in the forward pass—an $E$-dimensional slice of the frozen output $z_{1:E}$ and the PEFT-modified output $\hat{z}_{1:E}$—so no separate router weights are introduced (dashed router). This PEFT block can be replaced by other PEFT strategies (e.g., DoRA, Prompt Tuning, SliceFine) without changing LiME. (c) N-gram routing shares one routing decision within each window (e.g., $n{=}3$), and Auto Top-K selects experts with $w_i \ge \theta \times \max_j w_j$. (d) Load balancing losses prevent expert collapse and encourage more uniform utilization (Details in §\ref{['sec:method']}).
  • Figure 2: Efficiency comparison of LiME vs. MoE-PEFT baselines. (a) LiME variants (stars) achieve higher throughput and shorter training time; LiMEPromptTuning is the most efficient (4.52 samples/s, 25 min). (b) All methods show comparable peak memory due to the dominant frozen backbone. (c) LiME requires 0.02--0.57M trainable parameters—up to $4\times$ fewer than corresponding MoE-PEFT methods. (d) Total model size remains comparable ($\sim$894M) across all methods.
  • Figure 3: Empirical validation of our theory. (a--b) Linear probe accuracy at different token positions within an n-gram window (layers 24 and 22), supporting Theorem \ref{['thm:ngram_routing']}. (c--d) GLUE accuracy versus number of experts for LiME and MoELoRA, supporting Theorem \ref{['thm:mi']}; stars mark the best $E$ for each method.
  • Figure 4: Routing ablations. (a) Feature selection for routing is robust. (b) Zero-parameter routing matches learned routing performance. (c-d) Routing balance $\gamma_r \in [0.6, 0.8]$ yields optimal performance by combining frozen and adapted signals.
  • Figure 5: (a) Auto Top-K outperforms fixed Top-K. (b-c) Moderate load balancing prevents collapse while preserving specialization; over-balancing hurts accuracy. (d) Optimal expert count is $E \in [4, 6]$; beyond this, insufficient data limits further gains.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Theorem 1.1: More experts cannot reduce label information under refinement
  • proof
  • Theorem 2.1: Smoothed quantitative equivalence: expert-specific PEFT vs. LiME
  • proof
  • Definition 3.1: Mutual Information
  • Definition 3.2: Conditional Mutual Information
  • Theorem 3.3: Routing Informativeness in Causal N-gram Windows
  • proof