Two Is Better Than One: Rotations Scale LoRAs
Hongcan Guo, Guoshun Nan, Yuan Yang, Diyang Zhang, Haotian Li, Zhican Chen, Qinchuan Zhou, Yuhan Ran, Xinye Cao, Sicong Leng, Xiaofeng Tao, Xudong Jiang
TL;DR
RadarGate tackles the expressiveness limits of gating in LoRA-based MoEs by introducing rotational interactions over LoRA representations. The RotationGate applies input-dependent rotations to LoRA outputs and the StretchGate scales their magnitudes, expanding both the hypothesis and output spaces beyond the fixed convex cone $H = \text{conv}(igl\{v_i\bigr\})$ to dynamic cones $\mathcal{H}'(x)$. Theoretically, the method enlarges the feasible function class so that $\mathcal{K}_{gate} \subseteq \mathcal{K}_{ours}$ and $\bigcup_x \mathcal{H}'(x) \supseteq \mathcal{H}$, enabling better fitting of complex targets and improved generalization as the number of LoRAs grows. Empirically, RadarGate yields strong improvements across 21 tasks on six benchmarks, with stable convergence and favorable scaling in both module count and model size, while incurring minimal extra computational cost due to low-rank factorization. These results suggest a practical path to more scalable, parameter-efficient LoRA adaptations in LLMs and hint at strong potential for multimodal extensions.
Abstract
Scaling Low-Rank Adaptation (LoRA)-based Mixture-of-Experts (MoE) facilitates large language models (LLMs) to efficiently adapt to diverse tasks. However, traditional gating mechanisms that route inputs to the best experts may fundamentally hinder LLMs' scalability, leading to poor generalization and underfitting issues. We identify that the root cause lies in the restricted expressiveness of existing weighted-sum mechanisms, both within and outside the convex cone of LoRA representations. This motivates us to propose RadarGate, a novel geometrically inspired gating method that introduces rotational operations of LoRAs representations to boost the expressiveness and facilitate richer feature interactions among multiple LoRAs for scalable LLMs. Specifically, we first fuse each LoRA representation to other LoRAs using a learnable component and then feed the output to a rotation matrix. This matrix involves learnable parameters that define the relative angular relationship between LoRA representations. Such a simple yet effective mechanism provides an extra degree of freedom, facilitating the learning of cross-LoRA synergies and properly tracking the challenging poor generalization and underfitting issues as the number of LoRA grows. Extensive experiments on 6 public benchmarks across 21 tasks show the effectiveness of our RadarGate for scaling LoRAs. We also provide valuable insights, revealing that the rotations to each pair of representations are contrastive, encouraging closer alignment of semantically similar representations during geometrical transformation while pushing distance ones further apart. We will release our code to the community.
