Table of Contents
Fetching ...

FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji

TL;DR

FlyLoRA tackles parameter interference and inefficiency in MoE-based LoRA by introducing an implicit, rank-wise MoE where a fixed sparse random projection acts as the router. By activating only the top-k rank-1 components after projection, FlyLoRA achieves intra-task decoupling without explicit routing parameters, and theoretical results show distance preservation and reduced gradient covariance. It also enables training-free multi-task model merging via approximate orthogonality between independent random projections, mitigating inter-task interference. Empirically, FlyLoRA improves accuracy across knowledge, science, math, and code tasks with lower activated parameter counts and demonstrates strong robustness in single-task and multi-task settings. The work blends neuroscience-inspired design with PEFT, offering a scalable, efficient approach to decoupled task learning and merging.

Abstract

Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.

FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

TL;DR

FlyLoRA tackles parameter interference and inefficiency in MoE-based LoRA by introducing an implicit, rank-wise MoE where a fixed sparse random projection acts as the router. By activating only the top-k rank-1 components after projection, FlyLoRA achieves intra-task decoupling without explicit routing parameters, and theoretical results show distance preservation and reduced gradient covariance. It also enables training-free multi-task model merging via approximate orthogonality between independent random projections, mitigating inter-task interference. Empirically, FlyLoRA improves accuracy across knowledge, science, math, and code tasks with lower activated parameter counts and demonstrates strong robustness in single-task and multi-task settings. The work blends neuroscience-inspired design with PEFT, offering a scalable, efficient approach to decoupled task learning and merging.

Abstract

Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.

Paper Structure

This paper contains 45 sections, 9 theorems, 29 equations, 4 figures, 20 tables.

Key Result

Theorem 3.1

Given the matrix $\bm{A}\in\mathbb{R}^{r\times n}$ with each row having exactly $p$ non-zero entries randomly sampled from $\mathcal{N}(0, \frac{1}{r^2})$, for any $\epsilon>0$, for any input embeddings $\bm{x},\bm{y}\in\mathbb{R}^n$, where $\sigma^2=\frac{p}{nr^2}$. A detailed proof is provided in Appendix App:distance_preserving.

Figures (4)

  • Figure 1: (a) Accuracy comparison under a fixed total rank $r=32$ and activation rank $k=8$. Finer-grained rank allocation (from $4$ experts $\times~8$ rank to $32$ experts $\times~1$ rank) yields consistent performance gains. (b) Activated trainable parameters (relative to Full FT) under the same budget. Increasing expert granularity leads to a monotonic rise in activated parameters due to router overhead. (c) Schematic of the fly olfactory circuit. Odor signals in projection neurons (PNs) are randomly projected to Kenyon cells (KCs), with each KC connecting to a fixed number of PNs (but not all), forming sparse connections. These signals are then selectively projected to mushroom body output neurons (MBONs), while lateral inhibition from an anterior paired lateral (APL) neuron suppresses weak KC-MBON connections, implementing a winner-take-all strategy. Thus, the number of activated KCs is much smaller than the total dimension of the KC layer.
  • Figure 2: Schematic illustrations of different LoRA variants.(a) LoRA employs low-rank matrices $\bm{A}$ and $\bm{B}$ to simulate weight updates, where each row of $\bm{A}$ is fully connected to the corresponding column of $\bm{B}$. (b) MoE-based LoRA decomposes the updates into multiple small experts $\{\bm{A}_i, \bm{B}_i\}_{i=1}^N$ and uses a router to determine which experts should be activated. (c) FlyLoRA unifies the down-projection and router into a frozen matrix $\bm{A}$ and selectively activates only the ranks in $\bm{B}$ linked to the top-$k$ magnitude activations after projection through $\bm{A}$.
  • Figure 3: (a) Activation value magnitude distribution across dimensions, showing the mean activation strength at different top-$k$ selection percentages. (b-c) Gradient correlation matrices of (b) LoRA-FA$_{(r=32)}$ versus (c) FlyLoRA$_{(k=8)}$'s $\bm{B}$ matrices ($10$ randomly sampled columns). For a simplified illustration, we use the LoRA module of q_proj in the middle layer of Llama-3.1-8B on MMLU.
  • Figure 4: Accuracy comparison for: (a) Sparsity ratio in $\bm{A}$, (b) Activated rank (with fixed total rank $r=32$), (c) Total rank (with fixed activated rank $k=8$).

Theorems & Definitions (9)

  • Theorem 3.1
  • Theorem 3.3: Covariance Reduction Under top-$k$
  • Theorem 3.4: Approximate Subspace Orthogonality
  • Corollary 3.5
  • Theorem A.1
  • Theorem A.2
  • Theorem A.3: Covariance Reduction Under top-$k$
  • Theorem A.4: Approximate Subspace Orthogonality
  • Corollary A.5