Table of Contents
Fetching ...

MultiLoRA: Democratizing LoRA for Better Multi-Task Learning

Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang

TL;DR

This work tackles the challenge of efficiently adapting large language models to multi-task settings by addressing LoRA's top-singular-vector dominance. It introduces MultiLoRA, which horizontally scales LoRA modules, uses learnable per-module scaling, and adopts kaiming-based initialization to broaden the residual transform subspace. Through a carefully constructed multi-task dataset and extensive evaluation on LLaMA models (7B–65B), MultiLoRA consistently outperforms LoRA and matches or exceeds full fine-tuning with far fewer added parameters, especially on smaller models. Weight-update analyses via SVD show MultiLoRA yields more democratic unitary transform contributions, aligning its behavior closer to full fine-tuning and explaining its empirical gains for complex, multi-task adaptation.

Abstract

LoRA achieves remarkable resource efficiency and comparable performance when adapting LLMs for specific tasks. Since ChatGPT demonstrated superior performance on various tasks, there has been a growing desire to adapt one model for all tasks. However, the explicit low-rank of LoRA limits the adaptation performance in complex multi-task scenarios. LoRA is dominated by a small number of top singular vectors while fine-tuning decomposes into a set of less important unitary transforms. In this paper, we propose MultiLoRA for better multi-task adaptation by reducing the dominance of top singular vectors observed in LoRA. MultiLoRA scales LoRA modules horizontally and change parameter initialization of adaptation matrices to reduce parameter dependency, thus yields more balanced unitary subspaces. We unprecedentedly construct specialized training data by mixing datasets of instruction follow, natural language understanding, world knowledge, to cover semantically and syntactically different samples. With only 2.5% of additional parameters, MultiLoRA outperforms single LoRA counterparts and fine-tuning on multiple benchmarks and model scales. Further investigation into weight update matrices of MultiLoRA exhibits reduced dependency on top singular vectors and more democratic unitary transform contributions.

MultiLoRA: Democratizing LoRA for Better Multi-Task Learning

TL;DR

This work tackles the challenge of efficiently adapting large language models to multi-task settings by addressing LoRA's top-singular-vector dominance. It introduces MultiLoRA, which horizontally scales LoRA modules, uses learnable per-module scaling, and adopts kaiming-based initialization to broaden the residual transform subspace. Through a carefully constructed multi-task dataset and extensive evaluation on LLaMA models (7B–65B), MultiLoRA consistently outperforms LoRA and matches or exceeds full fine-tuning with far fewer added parameters, especially on smaller models. Weight-update analyses via SVD show MultiLoRA yields more democratic unitary transform contributions, aligning its behavior closer to full fine-tuning and explaining its empirical gains for complex, multi-task adaptation.

Abstract

LoRA achieves remarkable resource efficiency and comparable performance when adapting LLMs for specific tasks. Since ChatGPT demonstrated superior performance on various tasks, there has been a growing desire to adapt one model for all tasks. However, the explicit low-rank of LoRA limits the adaptation performance in complex multi-task scenarios. LoRA is dominated by a small number of top singular vectors while fine-tuning decomposes into a set of less important unitary transforms. In this paper, we propose MultiLoRA for better multi-task adaptation by reducing the dominance of top singular vectors observed in LoRA. MultiLoRA scales LoRA modules horizontally and change parameter initialization of adaptation matrices to reduce parameter dependency, thus yields more balanced unitary subspaces. We unprecedentedly construct specialized training data by mixing datasets of instruction follow, natural language understanding, world knowledge, to cover semantically and syntactically different samples. With only 2.5% of additional parameters, MultiLoRA outperforms single LoRA counterparts and fine-tuning on multiple benchmarks and model scales. Further investigation into weight update matrices of MultiLoRA exhibits reduced dependency on top singular vectors and more democratic unitary transform contributions.
Paper Structure (28 sections, 5 equations, 8 figures, 3 tables)

This paper contains 28 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Top singular value distribution of weight update matrix of $\Delta W_{v\_proj}$. (a) Complete view of the histogram. (b) Close-up view on top singular values. Both histograms are plotted based on on the negative logarithms of the singular values $-\log(s)$, where left end of horizontal axis represents larger singular values. Bell-shape curved of full parameter fine-tuning indicates a democratic composition of a large number of relatively less important unitary transforms. On the hand, LoRA heavily relies on a small group of important unitary transforms, which could hurt complex multi-task adaptation.
  • Figure 2: Overview of MultiLoRA. Multiple parallel LoRA modules are used to adapt target weight matrix. Parameter initialization and zero-initialized scaling factor are introduced to democratize residual weight updates.
  • Figure 3: (a) Throughput and (b) peak VRAM usage benchmarked when training LLaMA-7B with sequences of 1024 tokens and batch size of 1. $n\times r$ on horizontal axis indicates total rank of LoRA and MultiLoRA. Thanks to high parallelism of MultiLoRA, training throughput is almost identical to LoRA. VRAM usage scales up linearly with the number of parallel LoRA modules.
  • Figure 4: Subspace similarity to fine-tuning of LoRA (1), MultiLoRA (2, 3) and fine-tuning with a different random seed (4). LoRA (2) and MultiLoRA (3) share same parameter budget but MultiLoRA exhibits stronger subspace similarity to fine-tuning. Heatmap of MultiLoRA$_{r=32}^{n=3}$ does not differ much from that of MultiLoRA$_{r=32}^{n=5}$. Only $i,j\in[1,30]$ are presented for better visibility.
  • Figure 5: Singular value distribution of weight update matrices $\Delta W$ of k_proj (Left) and v_proj (Right). Our proposed MultiLoRA exhibits higher degree of resemblance to fine-tuning. Scaling up $n$ produces more democratic unitary transform contributions.
  • ...and 3 more figures