MoRe Fine-Tuning with 10x Fewer Parameters

Wenxuan Tan; Nicholas Roberts; Tzu-Heng Huang; Jitian Zhao; John Cooper; Samuel Guo; Chengyu Duan; Frederic Sala

MoRe Fine-Tuning with 10x Fewer Parameters

Wenxuan Tan, Nicholas Roberts, Tzu-Heng Huang, Jitian Zhao, John Cooper, Samuel Guo, Chengyu Duan, Frederic Sala

TL;DR

Theoretically, MoRe is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices -- potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

MoRe Fine-Tuning with 10x Fewer Parameters

TL;DR

Theoretically, MoRe is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

Abstract

Paper Structure (16 sections, 4 theorems, 8 equations, 5 figures, 6 tables)

This paper contains 16 sections, 4 theorems, 8 equations, 5 figures, 6 tables.

Introduction
Related Work
MoRe Framework
Architectural Choices & Analysis
Experimental Results
Conclusion
Theoretical results
Optimizations for Rectangular Monarch matrices
Hyperparameter Tuning
GLUE Language Understanding
Math reasoning and Commonsense reasoning
Architecture Ablations
Learned Weight Distributions
Failure Cases
Limitations and Future Work
...and 1 more sections

Key Result

Lemma 1.1

Let $W$ be an $n\times n$ matrix, where $n=m^2$ for some integer $m$. Let $W_{jk}$ denote the submatrix of $W$ such that Let $x\in \mathbb{R}^n$, with a similar decomposition into $x_k$ for $k=1,2,\hdots,m$. Then $\|Wx\|_2 \leq \sum_{jk}\|W_{jk}x_k\|_2$.

Figures (5)

Figure 1: The structure of low-rank Monarch matrices contains two permutations $P_1$ and $P_2$ along with two block-diagonal components $L$ and $R$ which are learned while $P_1$ and $P_2$ are both fixed. In the above, the number of blocks $N=4$, with input dimension $n=16$, and the block 'rank' is $r_{blk}=2$ and size is $n/N=4$. The pseudo-code can be found in appendix \ref{['algo:monarch']}.
Figure 2: Matthew's Correlation on CoLA when trade parameter counts for performance on two axes: the block dimension and the number of blocks, both with square blocks. The block dimensions used are $[4, 8, 16, 32, 64]$ and the $N$ are $[1024, 256, 128, 32, 16]$.
Figure 3: Fixing $r_{blk} = 4$, increasing the number of blocks beyond $4$ does not lead to better performance.
Figure 4: Llama 7b trained on Math Reasoning tasks
Figure 5: RoBERTa-large trained on CoLA

Theorems & Definitions (7)

Lemma 1.1
proof
Corollary 1.2
proof
Theorem 1.3
proof
Theorem 1.4

MoRe Fine-Tuning with 10x Fewer Parameters

TL;DR

Abstract

MoRe Fine-Tuning with 10x Fewer Parameters

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)