Table of Contents
Fetching ...

HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks

Nghiem T. Diep, Dung Le, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

TL;DR

HoRA tackles the inefficiency of fully independent per-head adapters in multi-head self-attention by introducing joint hypernetworks that share information across heads. The authors establish a theoretical link betweenMH-LoRA and Hierarchical Mixture of Experts, show that removing cross-head redundancy via a shared structure improves sample efficiency from exponential to polynomial rates, and demonstrate with extensive vision and language experiments that HoRA outperforms LoRA and other PEFT baselines while adding only a small number of trainable parameters. Practically, HoRA uses hypernetworks to generate both shared and head-specific low-rank adapters, enabling scalable, data-efficient fine-tuning of large transformers. The results indicate strong empirical gains on VTAB-1K, FGVC, and commonsense reasoning benchmarks, highlighting HoRA’s potential for resource-constrained fine-tuning across modalities.

Abstract

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adapts large pre-trained models by adding low-rank matrices to their weight updates. However, in the context of fine-tuning multi-head self-attention (MHA), LoRA has been employed to adapt each attention head separately, thereby overlooking potential synergies across different heads. To mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA) method, which utilizes joint hypernetworks to generate low-rank matrices across attention heads. By coupling their adaptation through a shared generator, HoRA encourages cross-head information sharing, and thus directly addresses the aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens of hierarchical mixture of experts, our theoretical findings reveal that the latter achieves superior sample efficiency to the former. Furthermore, through extensive experiments across diverse language and vision benchmarks, we demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring only a marginal increase in the number of trainable parameters.

HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks

TL;DR

HoRA tackles the inefficiency of fully independent per-head adapters in multi-head self-attention by introducing joint hypernetworks that share information across heads. The authors establish a theoretical link betweenMH-LoRA and Hierarchical Mixture of Experts, show that removing cross-head redundancy via a shared structure improves sample efficiency from exponential to polynomial rates, and demonstrate with extensive vision and language experiments that HoRA outperforms LoRA and other PEFT baselines while adding only a small number of trainable parameters. Practically, HoRA uses hypernetworks to generate both shared and head-specific low-rank adapters, enabling scalable, data-efficient fine-tuning of large transformers. The results indicate strong empirical gains on VTAB-1K, FGVC, and commonsense reasoning benchmarks, highlighting HoRA’s potential for resource-constrained fine-tuning across modalities.

Abstract

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adapts large pre-trained models by adding low-rank matrices to their weight updates. However, in the context of fine-tuning multi-head self-attention (MHA), LoRA has been employed to adapt each attention head separately, thereby overlooking potential synergies across different heads. To mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA) method, which utilizes joint hypernetworks to generate low-rank matrices across attention heads. By coupling their adaptation through a shared generator, HoRA encourages cross-head information sharing, and thus directly addresses the aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens of hierarchical mixture of experts, our theoretical findings reveal that the latter achieves superior sample efficiency to the former. Furthermore, through extensive experiments across diverse language and vision benchmarks, we demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring only a marginal increase in the number of trainable parameters.

Paper Structure

This paper contains 29 sections, 5 theorems, 134 equations, 3 figures, 5 tables.

Key Result

Theorem 1

Under the non-shared structure setting in Eq.(eqn:non_share_regression_function), the following minimax lower bound of estimating $G_*$ satisfies for any $r \geq 1$: where $\mathbb{E}_{g_G}$ stands for the expectation taken with respect to the product measure $g^n_{G}$.

Figures (3)

  • Figure 1: Illustration of HoRA in Multi-head Self-attention.
  • Figure 2: Sample efficiency on the commonsense reasoning datasets.
  • Figure 3: The detail of sample efficiency on each commonsense reasoning dataset with LLaMA-7B settings.

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • proof : Proof of Theorem \ref{['thm:non_share_sub_optimality']}
  • Proposition 1: Model convergence
  • proof : Return to the proof of Theorem \ref{['theorem:shared']}
  • proof : Proof of Proposition \ref{['prop:MLE_convergence_rate']}
  • Lemma 1
  • Lemma 2