Table of Contents
Fetching ...

MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation

Qin Dong, Yuntian Tang, Heming Jia, Yunhang Shen, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Shaohui Lin

TL;DR

MASA targets the representational bottleneck in LoRA by replacing a single down‑projection with a multi‑$A$ ensemble and a single $B$, coupled with asymmetric cross‑layer sharing across adjacent layers. The Multi‑$A$ Expert (MAE) block enriches feature extraction, while sharing $A$ across layer groups preserves efficiency; per‑layer $B$ maintains task‑specific transformations. Theoretical insight shows a single $A$ constrains information to at most $r$ channels, which can be lifted by aggregating multiple $A$’s, and empirical analyses (CKA, t‑SNE) corroborate cross‑layer generalization of $A$ and specialization of $A_i$ experts. Across MMLU, BBH, and domain‑specific tasks on LLaMA backbones, MASA outperforms strong PEFT baselines with only ~0.52% additional trainable parameters, demonstrating robust generalization and multi‑domain adaptability.

Abstract

Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection $A$ and one up-projection $B$. However, LoRA's reliance on a single down-projection matrix ($A$) creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific $B$-matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.

MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation

TL;DR

MASA targets the representational bottleneck in LoRA by replacing a single down‑projection with a multi‑ ensemble and a single , coupled with asymmetric cross‑layer sharing across adjacent layers. The Multi‑ Expert (MAE) block enriches feature extraction, while sharing across layer groups preserves efficiency; per‑layer maintains task‑specific transformations. Theoretical insight shows a single constrains information to at most channels, which can be lifted by aggregating multiple ’s, and empirical analyses (CKA, t‑SNE) corroborate cross‑layer generalization of and specialization of experts. Across MMLU, BBH, and domain‑specific tasks on LLaMA backbones, MASA outperforms strong PEFT baselines with only ~0.52% additional trainable parameters, demonstrating robust generalization and multi‑domain adaptability.

Abstract

Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection and one up-projection . However, LoRA's reliance on a single down-projection matrix () creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi- Shared Adaptation), an architecture that implements a multi-, single- structure where the multi- expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific -matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.

Paper Structure

This paper contains 29 sections, 2 theorems, 8 equations, 5 figures, 9 tables.

Key Result

Theorem 1

For any input ${\mathbf x}\!\in\!\mathbb{R}^{d_{\text{in}}}$ with covariance matrix $\Sigma_x\succeq0$, the down‑projection ${\mathbf u}=A\mathbf x$$(A\!\in\!\mathbb{R}^{r\times d_{\text{in}}})$ in a rank‑$r$ LoRA layer satisfies: for any deterministic post‑mapping $B$ (even multiple $B$‑heads). Thus, a single‑$A$ LoRA can transmit at most $r$ orthogonal information channels.

Figures (5)

  • Figure 1: Architectures of LoRA variants and our proposed MASA. (a) LoRA: Each fine‑tuned layer is augmented with a single pair of low‑rank adapters, one down‑projection $A$ and one up‑projection $B$. (b) LoRA+MoE: Model capacity is increased by instantiating $k$ independent adapter pairs $(A_i, B_i)$. (c) HydraLoRA: Several up‑projection heads $B_i$ share a common down‑projection $A$, forming a "single‑$A$, multi‑$B$" topology for parameter reuse. (d) Ours: We employ one shared up‑projection $B$ with multiple down‑projections $A_i$, which offers a balanced trade‑off between efficiency and representational capacity.
  • Figure 2: The t-SNE visualization of task-specific features extracted from the V-projection layer of the 11$^{th}$ layer in the LLaMA3-8B model, comparing LoRA and three selected experts of our method after fine-tuning on OpenOrca.
  • Figure 3: Adjacent-layer similarity analysis of LoRA modules using the CKA method. (Top) Heatmaps of CKA similarity scores between adjacent layers for LoRA $A$-matrix outputs (Top left) and full LoRA increments (Top right). (Bottom left) Bar chart of average similarity scores by module type for $A$-matrix outputs versus LoRA increments. (Bottom right) Line plots of layer-wise similarity trends between consecutive layers throughout the network depth.
  • Figure 4: An overview of our proposed MASA architecture. Each transformer layer is augmented with a shared set of $A$ modules and layer-specific $B$ modules. The shared $A$ modules are managed by a Multi-$A$ Expert (MAE) block, enabling inter-layer sharing for improved parameter efficiency via an Asymmetric Cross-layer Sharing (ACS). Pretrained weights ($W^{(i)}$) are frozen while only $\{A_{i}^{(k)}\}$ ($k=\lfloor l/S \rfloor$) and $B^{(l)}$ are updated during fine-tuning, where $S$ is the group size.
  • Figure 5: The t-SNE visualization of task-specific features extracted from the Q-projection layer of the (a) 11$^{th}$ and (b) 14$^{th}$ layer in the LLaMA3-8B model, comparing LoRA and three selected experts of our method after fine-tuning on OpenOrca.

Theorems & Definitions (4)

  • Theorem 1: Information Ceiling
  • proof : Proof (sketch)
  • Theorem 2: Information Ceiling
  • proof