Table of Contents
Fetching ...

Two Is Better Than One: Rotations Scale LoRAs

Hongcan Guo, Guoshun Nan, Yuan Yang, Diyang Zhang, Haotian Li, Zhican Chen, Qinchuan Zhou, Yuhan Ran, Xinye Cao, Sicong Leng, Xiaofeng Tao, Xudong Jiang

TL;DR

RadarGate tackles the expressiveness limits of gating in LoRA-based MoEs by introducing rotational interactions over LoRA representations. The RotationGate applies input-dependent rotations to LoRA outputs and the StretchGate scales their magnitudes, expanding both the hypothesis and output spaces beyond the fixed convex cone $H = \text{conv}(igl\{v_i\bigr\})$ to dynamic cones $\mathcal{H}'(x)$. Theoretically, the method enlarges the feasible function class so that $\mathcal{K}_{gate} \subseteq \mathcal{K}_{ours}$ and $\bigcup_x \mathcal{H}'(x) \supseteq \mathcal{H}$, enabling better fitting of complex targets and improved generalization as the number of LoRAs grows. Empirically, RadarGate yields strong improvements across 21 tasks on six benchmarks, with stable convergence and favorable scaling in both module count and model size, while incurring minimal extra computational cost due to low-rank factorization. These results suggest a practical path to more scalable, parameter-efficient LoRA adaptations in LLMs and hint at strong potential for multimodal extensions.

Abstract

Scaling Low-Rank Adaptation (LoRA)-based Mixture-of-Experts (MoE) facilitates large language models (LLMs) to efficiently adapt to diverse tasks. However, traditional gating mechanisms that route inputs to the best experts may fundamentally hinder LLMs' scalability, leading to poor generalization and underfitting issues. We identify that the root cause lies in the restricted expressiveness of existing weighted-sum mechanisms, both within and outside the convex cone of LoRA representations. This motivates us to propose RadarGate, a novel geometrically inspired gating method that introduces rotational operations of LoRAs representations to boost the expressiveness and facilitate richer feature interactions among multiple LoRAs for scalable LLMs. Specifically, we first fuse each LoRA representation to other LoRAs using a learnable component and then feed the output to a rotation matrix. This matrix involves learnable parameters that define the relative angular relationship between LoRA representations. Such a simple yet effective mechanism provides an extra degree of freedom, facilitating the learning of cross-LoRA synergies and properly tracking the challenging poor generalization and underfitting issues as the number of LoRA grows. Extensive experiments on 6 public benchmarks across 21 tasks show the effectiveness of our RadarGate for scaling LoRAs. We also provide valuable insights, revealing that the rotations to each pair of representations are contrastive, encouraging closer alignment of semantically similar representations during geometrical transformation while pushing distance ones further apart. We will release our code to the community.

Two Is Better Than One: Rotations Scale LoRAs

TL;DR

RadarGate tackles the expressiveness limits of gating in LoRA-based MoEs by introducing rotational interactions over LoRA representations. The RotationGate applies input-dependent rotations to LoRA outputs and the StretchGate scales their magnitudes, expanding both the hypothesis and output spaces beyond the fixed convex cone to dynamic cones . Theoretically, the method enlarges the feasible function class so that and , enabling better fitting of complex targets and improved generalization as the number of LoRAs grows. Empirically, RadarGate yields strong improvements across 21 tasks on six benchmarks, with stable convergence and favorable scaling in both module count and model size, while incurring minimal extra computational cost due to low-rank factorization. These results suggest a practical path to more scalable, parameter-efficient LoRA adaptations in LLMs and hint at strong potential for multimodal extensions.

Abstract

Scaling Low-Rank Adaptation (LoRA)-based Mixture-of-Experts (MoE) facilitates large language models (LLMs) to efficiently adapt to diverse tasks. However, traditional gating mechanisms that route inputs to the best experts may fundamentally hinder LLMs' scalability, leading to poor generalization and underfitting issues. We identify that the root cause lies in the restricted expressiveness of existing weighted-sum mechanisms, both within and outside the convex cone of LoRA representations. This motivates us to propose RadarGate, a novel geometrically inspired gating method that introduces rotational operations of LoRAs representations to boost the expressiveness and facilitate richer feature interactions among multiple LoRAs for scalable LLMs. Specifically, we first fuse each LoRA representation to other LoRAs using a learnable component and then feed the output to a rotation matrix. This matrix involves learnable parameters that define the relative angular relationship between LoRA representations. Such a simple yet effective mechanism provides an extra degree of freedom, facilitating the learning of cross-LoRA synergies and properly tracking the challenging poor generalization and underfitting issues as the number of LoRA grows. Extensive experiments on 6 public benchmarks across 21 tasks show the effectiveness of our RadarGate for scaling LoRAs. We also provide valuable insights, revealing that the rotations to each pair of representations are contrastive, encouraging closer alignment of semantically similar representations during geometrical transformation while pushing distance ones further apart. We will release our code to the community.

Paper Structure

This paper contains 36 sections, 4 theorems, 56 equations, 16 figures, 7 tables.

Key Result

Lemma 1

For nested function hypothesis spaces $\mathcal{K}_1 \subseteq \mathcal{K}_2$, the optimal fitting error $\mathcal{E}_t = \inf_{f \in \mathcal{K}_t} L(f, g^*)$ of the target function $g^*$ under the loss function $L$ necessarily satisfies $\mathcal{E}_2 \leq \mathcal{E}_1$.

Figures (16)

  • Figure 1: (a) Composable LoRA-MoE performs even worse than vanilla LoRA. (b) Poor generalization of different LoRA-MoE architectures as the number of LoRA grows. (c) Underfitting of various gating methods as the LoRA scales up.
  • Figure 2: Workflow of the proposed RadarGate. Two key ingredients are RotationGate and StretchGate. RotationGate takes LoRA representations as inputs and then proceeds to three steps, including 1) LoRA representation categorization, 2) rotation angles generation, and 3) angles injection. The rotated LoRA representation will be stretched in magnitude by StretchGate to get the output.
  • Figure 3: Performance on Fitting and Ablation. Figure (a) shows performance of fitting capability on same-source training/test sets, while Figures (b) and (c) show ablation results for RadarGate's StretchGate and RotationGate components.
  • Figure 4: Scaling performance comparison on three benchmarks. Figures (a), (b), and (c) compare the scaling performance of RadarGate with four baselines as the number of modules increases. Figure (d), (e), and (f) show performance across different model sizes.
  • Figure 5: A GLUE case study visualizes RadarGate. The proposed RadarGate can correctly integrate representations along global norm weight and local angular weight by mid-training, yielding the correct answers,while the Gate in the previous MoLE method fails to generate the correct answer.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Lemma 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof