Table of Contents
Fetching ...

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

Xiaojun Xiao, Sen Shen, Qiming Bao, Hongfei Rong, Kairui Liu, Zhongsheng Wang, Jiamou Liu

TL;DR

This work addresses the high computational cost of fine-tuning large language models by compressing LoRA through a shared common subspace. CoRA extracts a common B-like subspace from multiple downstream models using SVD on aggregated QKV matrices, substituting B with a reduced basis and offering two schemes: freeze the common subspace or use it as a better initialization for B. Empirical results on Llama2-13B-hf across yahma and a text-to-code dataset show that freezing achieves similar efficacy with roughly half the trainable parameters, while the initialization variant can yield up to a 4% improvement in generation quality under the same budget. The study combines objective metrics (ROUGE, METEOR, SacreBLEU, BERT Score) with GPT-4 and human evaluations to demonstrate practical gains in efficiency and performance, pointing to broader applicability of shared subspaces for resource-constrained fine-tuning.

Abstract

In fine-tuning large language models (LLMs), conserving computational resources while maintaining effectiveness and improving outcomes within the same computational constraints is crucial. The Low-Rank Adaptation (LoRA) strategy balances efficiency and performance in fine-tuning large models by reducing the number of trainable parameters and computational costs. However, current advancements in LoRA might be focused on its fine-tuning methodologies, with not as much exploration as might be expected into further compression of LoRA. Since most of LoRA's parameters might still be superfluous, this may lead to unnecessary wastage of computational resources. In this paper, we propose \textbf{CoRA}: leveraging shared knowledge to optimize LoRA training by substituting its matrix $B$ with a common subspace from large models. Our two-fold method includes (1) Freezing the substitute matrix $B$ to halve parameters while training matrix $A$ for specific tasks and (2) Using the substitute matrix $B$ as an enhanced initial state for the original matrix $B$, achieving improved results with the same parameters. Our experiments show that the first approach achieves the same efficacy as the original LoRA fine-tuning while being more efficient than halving parameters. At the same time, the second approach has some improvements compared to LoRA's original fine-tuning performance. They generally attest to the effectiveness of our work.

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

TL;DR

This work addresses the high computational cost of fine-tuning large language models by compressing LoRA through a shared common subspace. CoRA extracts a common B-like subspace from multiple downstream models using SVD on aggregated QKV matrices, substituting B with a reduced basis and offering two schemes: freeze the common subspace or use it as a better initialization for B. Empirical results on Llama2-13B-hf across yahma and a text-to-code dataset show that freezing achieves similar efficacy with roughly half the trainable parameters, while the initialization variant can yield up to a 4% improvement in generation quality under the same budget. The study combines objective metrics (ROUGE, METEOR, SacreBLEU, BERT Score) with GPT-4 and human evaluations to demonstrate practical gains in efficiency and performance, pointing to broader applicability of shared subspaces for resource-constrained fine-tuning.

Abstract

In fine-tuning large language models (LLMs), conserving computational resources while maintaining effectiveness and improving outcomes within the same computational constraints is crucial. The Low-Rank Adaptation (LoRA) strategy balances efficiency and performance in fine-tuning large models by reducing the number of trainable parameters and computational costs. However, current advancements in LoRA might be focused on its fine-tuning methodologies, with not as much exploration as might be expected into further compression of LoRA. Since most of LoRA's parameters might still be superfluous, this may lead to unnecessary wastage of computational resources. In this paper, we propose \textbf{CoRA}: leveraging shared knowledge to optimize LoRA training by substituting its matrix with a common subspace from large models. Our two-fold method includes (1) Freezing the substitute matrix to halve parameters while training matrix for specific tasks and (2) Using the substitute matrix as an enhanced initial state for the original matrix , achieving improved results with the same parameters. Our experiments show that the first approach achieves the same efficacy as the original LoRA fine-tuning while being more efficient than halving parameters. At the same time, the second approach has some improvements compared to LoRA's original fine-tuning performance. They generally attest to the effectiveness of our work.
Paper Structure (12 sections, 9 equations, 4 figures, 6 tables)

This paper contains 12 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A simple case better explains the problems this paper focuses on and what CoRA is doing. A new method initializes the B matrix to replace the initialization in LoRA and complete the fine-tuning tasks.
  • Figure 2: An overview of CoRA. In the downstream large models, we extracted a common basis matrix within the corresponding attention heads' $Q$, $K$, and $V$ matrices. Utilizing Singular Value Decomposition (SVD) for dimensionality reduction, we adapted this matrix to meet the projection specifications required by the LoRA fine-tuning paradigm. The adapted common basis matrix replaced the original $B$ matrices in the LoRA on the corresponding attention heads' $Q$, $K$, and $V$ matrices. This modification was integrated into subsequent training processes. The fine-tuning was conducted using two methods: one where the $B$ matrix was replaced with a common basis matrix and then froze, and another where the $B$ matrix was replaced but kept training.
  • Figure 3: Details in CoRA about extracting the common matrix space and performing SVD dimensionality reduction to adapt the $B$ matrix's expression form and apply it to lightweight large model fine-tuning.
  • Figure 4: The number of principal components or singular values required to explain the variance using PCA (left) and SVD(right) techniques fully. PCA requires about 3000 principal components, while SVD only requires about 130 singular values, showing the efficiency of SVD in dimensionality reduction of high-dimensional data.