Table of Contents
Fetching ...

HyperAdapt: Simple High-Rank Adaptation

Abel Gurung, Joseph Campbell

TL;DR

HyperAdapt presents a simple, parameter-efficient fine-tuning method that achieves high-rank updates by row- and column-wise diagonal scaling of a pre-trained weight matrix, requiring only $n+m$ trainable parameters per matrix. Theoretical rank bounds show the update can be effectively high-rank, and empirical results across GLUE, arithmetic and commonsense reasoning, and long-context reasoning demonstrate close to full fine-tuning performance with orders of magnitude fewer trainable parameters and no additional inference latency. The approach outperforms or matches strong PEFT baselines like LoRA, DoRA, and VeRA across multiple model sizes, while dramatically reducing memory and compute requirements. This makes high-rank adaptation practical for constrained compute/memory scenarios and scalable to large foundation models, with potential extensions to broader architectures and domains.

Abstract

Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only $n+m$ trainable parameters for an $n \times m$ matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt's updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.

HyperAdapt: Simple High-Rank Adaptation

TL;DR

HyperAdapt presents a simple, parameter-efficient fine-tuning method that achieves high-rank updates by row- and column-wise diagonal scaling of a pre-trained weight matrix, requiring only trainable parameters per matrix. Theoretical rank bounds show the update can be effectively high-rank, and empirical results across GLUE, arithmetic and commonsense reasoning, and long-context reasoning demonstrate close to full fine-tuning performance with orders of magnitude fewer trainable parameters and no additional inference latency. The approach outperforms or matches strong PEFT baselines like LoRA, DoRA, and VeRA across multiple model sizes, while dramatically reducing memory and compute requirements. This makes high-rank adaptation practical for constrained compute/memory scenarios and scalable to large foundation models, with potential extensions to broader architectures and domains.

Abstract

Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only trainable parameters for an matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt's updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.

Paper Structure

This paper contains 19 sections, 1 theorem, 9 equations, 6 figures, 9 tables.

Key Result

Lemma 0

Let $\mathrm{W}_0\in\mathbb{R}^{n\times m}$ and let $\mathrm{A}\in\mathbb{R}^{n\times n}$, $\mathrm{B}\in\mathbb{R}^{m\times m}$ be diagonal matrices. Define $\Delta \mathrm{W} := \mathrm{A}\,\mathrm{W}_0\,\mathrm{B}-\mathrm{W}_0$. Then $\mathrm{rank}(\Delta \mathrm{W}) \le \min\{2 \cdot\mathrm{rank

Figures (6)

  • Figure 1: Overview of HyperAdapt: (Left) Our proposed method, HyperAdapt, fine-tunes a model by learning row-wise and column-wise diagonal matrices. Unlike full fine-tuning, which requires $n \times m$ trainable parameters, our method yields comparable performance yet only requires $n + m$ trainable parameters. Grayscale values represent frozen parameters, while colored values represent trainable parameters. (Right) Our method achieves similar performance to LoRA across common benchmarks while using up to significantly fewer trainable parameters.
  • Figure 2: HyperAdapt adjusts a large number of directions via scaling, bootstrapping from pre-trained orthogonal directions (knowledge), achieving a high-rank update. In contrast, low-rank methods modify a limited subset of vectors without any constraint.
  • Figure 3: Normalized update rank across all layers of Qwen-2.5-7B after fine-tuning on Commonsense170K. HyperAdapt produces high-rank updates across most modules effectively utilizing a large fraction of available orthogonal directions.
  • Figure 4: Singular-value spectra of the update matrix $\Delta \mathrm W$ as given by HyperAdapt and LoRA for Qwen-2.5-7B and Llama-3-8B. We visualize the first 50 singular values of the update matrix in log scale; values above $1 \times 10^{-2}$ are considered to be non-negligible and contribute to the update's rank. The red dashed line indicates the rank $r$ of LoRA, showing that all values beyond this are negligible. In contrast, HyperAdapt exhibits a slower decay, reflecting a higher-rank update. The top row corresponds to the Query matrix $\mathrm{\Delta W}_Q$ of the 13th layer, and the bottom row corresponds to the Value matrix $\mathrm{\Delta W}_V$ of the 13th layer.
  • Figure 5: Learning rate sensitivity on GSM8K. Best performance is between $3.0\times10^{-4}$ and $3.0\times10^{-3}$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma 0
  • proof