BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Yibin Wang; Haizhou Shi; Ligong Han; Dimitris Metaxas; Hao Wang

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, Hao Wang

TL;DR

This work tackles overconfidence in large language models during domain-specific fine-tuning by proposing BLoB, a Bayesian low-rank adaptation method that jointly learns the mean and covariance of a LoRA-based parameter update in backpropagation. By asymmetric Bayesianizing only the low-rank A while keeping B fixed, and enforcing a low-rank prior on full weights, BLoB achieves efficient variational inference with a closed-form KL term and improved sample efficiency through Flipout. Empirically, BLoB delivers superior uncertainty calibration (lower NLL and ECE) and strong generalization on in-distribution data, with competitive or superior performance under distributional shift across multiple tasks and architectures, while incurring modest memory and compute overhead. Overall, BLoB demonstrates that simultaneous optimization of the mean and covariance in a low-rank posterior during fine-tuning can enhance reliability and robustness of LLMs in practical deployment.

Abstract

Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

TL;DR

Abstract

Paper Structure (32 sections, 2 theorems, 47 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 32 sections, 2 theorems, 47 equations, 5 figures, 11 tables, 1 algorithm.

Introduction
Preliminaries
Low-Rank Adaptation (LoRA)
Variational Bayesian Networks (VBNs)
Methodology
Low-Rank Variational Approximate Posterior Distribution: LoRA Bayesianization
Low-Rank Prior Distribution
Parameterization of the Low-Rank Variational Distribution
On Improving the Sample Efficiency of BLoB
BLoB: Final Algorithm
Experiments
Settings
Results on In-distribution Datasets
Results on Out-of-Distribution Datasets
Related Work
...and 17 more sections

Key Result

Theorem 3.1

With the pre-trained weight matrix ${\bm{W}}_0 \in\mathbb{R}^{m\times n}$ and the low-rank weight update matrix ${\bm{B}} \in \mathbb{R}^{m\times r}$, suppose that the variational distribution of the other low-rank update matrix ${\bm{A}} \in \mathbb{R}^{r\times n}$ is Gaussian with $q({\bm{A}}|{\bm

Figures (5)

Figure 1: Overview of our Bayesian Low-Rank Adaptation by Backpropagation, i.e., BLoB (right) as well as comparison with existing methods such as LoRA (left) and Laplace LoRA (middle).
Figure 2: The growth curve of $\sigma_q = \text{log}(1+e^{\rho})$ and $\sigma_q = \rho^2$ during the optimization of KL divergence (without data likelihood). The number of gradient steps (5000) is marked with the red line.
Figure 3: Performance of BLoB with Varying Sample Sizes $N$ during Inference. We fine-tune the Llama2-7B model on the WG-S dataset for 5,000 steps, evaluating the model's performance with different sample sizes, specifically when $N$ is 1, 2, 3, 4, 5, 10, 20, 40, 80, and 160.
Figure 4: Performance of BLoB (N=10) with Varying Prior Gaussian Standard Deviations $\sigma_p$. We fine-tune the Llama2-7B model on the WG-S dataset for 5,000 gradient steps, evaluating the model's performance with different prior Gaussian standard deviations and learning rates of KL divergence.
Figure 5: Visualization of embedding uncertainty quality for different methods. The model is fine-tuned for 5,000 steps on the Llama2-7B. We fine-tune the Llama2-7B model on the OBQA dataset for 5000 steps. The two contour lines represent the probability mass of 0.5 and 0.75, respectively.

Theorems & Definitions (5)

Theorem 3.1: Variational Distribution of the Full-Weight Matrix in BLoB
Remark
Theorem 3.2: Efficient Computation of Full-Weight KL Divergence
proof
proof

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

TL;DR

Abstract

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)