Table of Contents
Fetching ...

Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian Gates

Cristian Meo, Ksenia Sycheva, Anirudh Goyal, Justin Dauwels

TL;DR

Bayesian-LoRA is proposed which approaches low-rank adaptation and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values, and is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix.

Abstract

It is a common practice in natural language processing to pre-train a single model on a general domain and then fine-tune it for downstream tasks. However, when it comes to Large Language Models, fine-tuning the entire model can be computationally expensive, resulting in very intensive energy consumption. As a result, several Parameter Efficient Fine-Tuning (PEFT) approaches were recently proposed. One of the most popular approaches is low-rank adaptation (LoRA), where the key insight is decomposing the update weights of the pre-trained model into two low-rank matrices. However, the proposed approaches either use the same rank value across all different weight matrices, which has been shown to be a sub-optimal choice, or do not use any quantization technique, one of the most important factors when it comes to a model's energy consumption. In this work, we propose Bayesian-LoRA which approaches low-rank adaptation and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values. As a result, B-LoRA is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix. We validate the proposed model by fine-tuning a pre-trained DeBERTaV3 on the GLUE benchmark. Moreover, we compare it to relevant baselines and present both qualitative and quantitative results, showing how the proposed approach is able to learn optimal-rank quantized matrices. B-LoRA performs on par with or better than the baselines while reducing the total number of bit operations by roughly 70% compared to the baseline methods.

Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian Gates

TL;DR

Bayesian-LoRA is proposed which approaches low-rank adaptation and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values, and is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix.

Abstract

It is a common practice in natural language processing to pre-train a single model on a general domain and then fine-tune it for downstream tasks. However, when it comes to Large Language Models, fine-tuning the entire model can be computationally expensive, resulting in very intensive energy consumption. As a result, several Parameter Efficient Fine-Tuning (PEFT) approaches were recently proposed. One of the most popular approaches is low-rank adaptation (LoRA), where the key insight is decomposing the update weights of the pre-trained model into two low-rank matrices. However, the proposed approaches either use the same rank value across all different weight matrices, which has been shown to be a sub-optimal choice, or do not use any quantization technique, one of the most important factors when it comes to a model's energy consumption. In this work, we propose Bayesian-LoRA which approaches low-rank adaptation and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values. As a result, B-LoRA is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix. We validate the proposed model by fine-tuning a pre-trained DeBERTaV3 on the GLUE benchmark. Moreover, we compare it to relevant baselines and present both qualitative and quantitative results, showing how the proposed approach is able to learn optimal-rank quantized matrices. B-LoRA performs on par with or better than the baselines while reducing the total number of bit operations by roughly 70% compared to the baseline methods.
Paper Structure (29 sections, 34 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 29 sections, 34 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: (Left) B-LoRA Scheme: As mentioned in Sec. \ref{['sec:intro']}, every weight $W$ we can be decomposed as $W = W_0 + BEA$. (Right) Rank Adaptation and Quantization techniques are visually represented following equation \ref{['eq:posterior_r']} for Rank Adaption, and equations \ref{['eq:priors_z_active']} and \ref{['eq:priors_z_inactive']} for Quantization, respectively. Visual Representation of quantization technique is taken from Mart2020Bayesian.
  • Figure 2: Rank distribution for GLUE benchmark. Last layers have larger rank values compared to the first layers. Ranks of values $W_v$ are larger than ranks of keys $W_k$ and queries $W_q$.
  • Figure 3: Quantization levels for GLUE benchmark. For each type of weight/activation, we compute median value of its bitwidth across the encoder. LoRA modules are kept in lower precision of $2, 4$ bits. Values $W_v$ are kept in higher precision than keys $W_k$ and queries $W_q$.
  • Figure 4: Rank Distribution for AdaLoRA on MNLI dataset.