Table of Contents
Fetching ...

Bayesian Low-rank Adaptation for Large Language Models

Adam X. Yang, Maxime Robeyns, Xi Wang, Laurence Aitchison

TL;DR

Fine-tuned LLMs often exhibit overconfidence, particularly with limited data. The paper introduces Laplace-LoRA, a post-hoc Bayesian method that applies a Laplace approximation to the LoRA parameter posterior, enabling uncertainty estimation without modifying existing fine-tuning pipelines. Across six common-sense tasks and distribution-shift scenarios, Laplace-LoRA substantially improves calibration metrics (ECE, NLL) with only modest memory and runtime overhead. This approach demonstrates that scalable, uncertainty-aware fine-tuning is feasible for parameter-efficient adapters in large language models.

Abstract

Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient fine-tuning of large language models (LLMs). However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs.

Bayesian Low-rank Adaptation for Large Language Models

TL;DR

Fine-tuned LLMs often exhibit overconfidence, particularly with limited data. The paper introduces Laplace-LoRA, a post-hoc Bayesian method that applies a Laplace approximation to the LoRA parameter posterior, enabling uncertainty estimation without modifying existing fine-tuning pipelines. Across six common-sense tasks and distribution-shift scenarios, Laplace-LoRA substantially improves calibration metrics (ECE, NLL) with only modest memory and runtime overhead. This approach demonstrates that scalable, uncertainty-aware fine-tuning is feasible for parameter-efficient adapters in large language models.

Abstract

Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient fine-tuning of large language models (LLMs). However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs.
Paper Structure (37 sections, 28 equations, 12 figures, 19 tables, 3 algorithms)

This paper contains 37 sections, 28 equations, 12 figures, 19 tables, 3 algorithms.

Figures (12)

  • Figure 1: Fine-tuning of LlaMA2-7B across six common sense reasoning tasks (presented column-wise, with number of training examples in brackets), evaluated on the test set every 1000 gradient steps. The vertical dashed line gives the number of training steps with optimal MAP performance, and indicates that Laplace is likely to offer benefits even when combined with early stopping. MAP: standard fine-tuning; MC Dropout: Monte-Carlo dropout; Checkpoint Ensemble: ensembling three most recent checkpoints; Ensemble: ensembling three LoRA fine-tuned models; LLLA: last-layer Laplace approximation on LoRA weights in the output layer; LA: full Laplace approximation on all LoRA weights.
  • Figure 2: Fine-tuning of LlaMA2-7B across six common sense reasoning tasks, comparing different Laplace predictive posterior approximations: Laplace bridge approximation (bridge), generalized probit approximation (probit), Monte Carlo sampling using the diagonal covariance (MC indep), and Monte Carlo sampling using the full covariance (MC joint).
  • Figure 3: Fine-tuning of RoBERTa-base across six GLUE and SuperGLUE tasks (presented column-wise, with number of training examples in brackets), evaluated on the test set every 1000 gradient steps, without a validation set for hyperparameter tuning. The vertical dashed line gives the number of training steps with optimal MAP performance. Note that RoBERTa-base seems to fail on WNLI, but RoBERTa-large succeeds (Fig. \ref{['fig:roberta-large-kron']}).
  • Figure 4: Fine-tuning RoBERTa-large across the six GLUE and SuperGLUE tasks in Fig. \ref{['fig:roberta-base-kron']}, without a validation set for hyperparameter tuning. The vertical dashed line gives the number of training steps with optimal MAP performance.
  • Figure 5: Fine-tuning RoBERTa-base across the six GLUE and SuperGLUE tasks, with a validation set for hyperparameter tuning. The vertical dashed line gives the checkpoint with optimal MAP performance on the validation set.
  • ...and 7 more figures