Table of Contents
Fetching ...

MoRe Fine-Tuning with 10x Fewer Parameters

Wenxuan Tan, Nicholas Roberts, Tzu-Heng Huang, Jitian Zhao, John Cooper, Samuel Guo, Chengyu Duan, Frederic Sala

TL;DR

Theoretically, MoRe is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices -- potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

MoRe Fine-Tuning with 10x Fewer Parameters

TL;DR

Theoretically, MoRe is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices -- potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.
Paper Structure (16 sections, 4 theorems, 8 equations, 5 figures, 6 tables)

This paper contains 16 sections, 4 theorems, 8 equations, 5 figures, 6 tables.

Key Result

Lemma 1.1

Let $W$ be an $n\times n$ matrix, where $n=m^2$ for some integer $m$. Let $W_{jk}$ denote the submatrix of $W$ such that Let $x\in \mathbb{R}^n$, with a similar decomposition into $x_k$ for $k=1,2,\hdots,m$. Then $\|Wx\|_2 \leq \sum_{jk}\|W_{jk}x_k\|_2$.

Figures (5)

  • Figure 1: The structure of low-rank Monarch matrices contains two permutations $P_1$ and $P_2$ along with two block-diagonal components $L$ and $R$ which are learned while $P_1$ and $P_2$ are both fixed. In the above, the number of blocks $N=4$, with input dimension $n=16$, and the block 'rank' is $r_{blk}=2$ and size is $n/N=4$. The pseudo-code can be found in appendix \ref{['algo:monarch']}.
  • Figure 2: Matthew's Correlation on CoLA when trade parameter counts for performance on two axes: the block dimension and the number of blocks, both with square blocks. The block dimensions used are $[4, 8, 16, 32, 64]$ and the $N$ are $[1024, 256, 128, 32, 16]$.
  • Figure 3: Fixing $r_{blk} = 4$, increasing the number of blocks beyond $4$ does not lead to better performance.
  • Figure 4: Llama 7b trained on Math Reasoning tasks
  • Figure 5: RoBERTa-large trained on CoLA

Theorems & Definitions (7)

  • Lemma 1.1
  • proof
  • Corollary 1.2
  • proof
  • Theorem 1.3
  • proof
  • Theorem 1.4