Table of Contents
Fetching ...

LoRA ensembles for large language model fine-tuning

Xi Wang, Laurence Aitchison, Maja Rudolph

TL;DR

This work tackles the poor uncertainty quantification of fine-tuned LLMs and the impracticality of large model ensembles by proposing LoRA ensembles, which attach low-rank adapters to a shared base model to form scalable, memory-efficient ensembles. The method enables many ensemble components with minimal overhead and demonstrates improvements in both predictive accuracy and calibration across multiple QA tasks, including out-of-distribution scenarios. The authors further study how regularization (e.g., KL, early stopping, and large weight decay on the adapter matrices) interacts with ensembling, showing that LoRA ensembles are complementary to existing strategies and can yield robust uncertainty estimates. Overall, LoRA ensembles provide a practical avenue for reliable, scalable uncertainty quantification in large language models for real-world applications.

Abstract

Finetuned LLMs often exhibit poor uncertainty quantification, manifesting as overconfidence, poor calibration, and unreliable prediction results on test data or out-of-distribution samples. One approach commonly used in vision for alleviating this issue is a deep ensemble, which constructs an ensemble by training the same model multiple times using different random initializations. However, there is a huge challenge to ensembling LLMs: the most effective LLMs are very, very large. Keeping a single LLM in memory is already challenging enough: keeping an ensemble of e.g. 5 LLMs in memory is impossible in many settings. To address these issues, we propose an ensemble approach using Low-Rank Adapters (LoRA), a parameter-efficient fine-tuning technique. Critically, these low-rank adapters represent a very small number of parameters, orders of magnitude less than the underlying pre-trained model. Thus, it is possible to construct large ensembles of LoRA adapters with almost the same computational overhead as using the original model. We find that LoRA ensembles, applied on its own or on top of pre-existing regularization techniques, gives consistent improvements in predictive accuracy and uncertainty quantification.

LoRA ensembles for large language model fine-tuning

TL;DR

This work tackles the poor uncertainty quantification of fine-tuned LLMs and the impracticality of large model ensembles by proposing LoRA ensembles, which attach low-rank adapters to a shared base model to form scalable, memory-efficient ensembles. The method enables many ensemble components with minimal overhead and demonstrates improvements in both predictive accuracy and calibration across multiple QA tasks, including out-of-distribution scenarios. The authors further study how regularization (e.g., KL, early stopping, and large weight decay on the adapter matrices) interacts with ensembling, showing that LoRA ensembles are complementary to existing strategies and can yield robust uncertainty estimates. Overall, LoRA ensembles provide a practical avenue for reliable, scalable uncertainty quantification in large language models for real-world applications.

Abstract

Finetuned LLMs often exhibit poor uncertainty quantification, manifesting as overconfidence, poor calibration, and unreliable prediction results on test data or out-of-distribution samples. One approach commonly used in vision for alleviating this issue is a deep ensemble, which constructs an ensemble by training the same model multiple times using different random initializations. However, there is a huge challenge to ensembling LLMs: the most effective LLMs are very, very large. Keeping a single LLM in memory is already challenging enough: keeping an ensemble of e.g. 5 LLMs in memory is impossible in many settings. To address these issues, we propose an ensemble approach using Low-Rank Adapters (LoRA), a parameter-efficient fine-tuning technique. Critically, these low-rank adapters represent a very small number of parameters, orders of magnitude less than the underlying pre-trained model. Thus, it is possible to construct large ensembles of LoRA adapters with almost the same computational overhead as using the original model. We find that LoRA ensembles, applied on its own or on top of pre-existing regularization techniques, gives consistent improvements in predictive accuracy and uncertainty quantification.
Paper Structure (22 sections, 5 equations, 8 figures, 2 tables)

This paper contains 22 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: LoRA ensembles with strong weight decay regularization, is more accurate and better calibrated than fine-tuning a single LoRA component on multiple-choice QA problems such as in Fig. \ref{['fig:prompt_template']}. Fig \ref{['fig:overconfidence_viz']}, shows a KDE of the confidence with which a pre-trained LLaMA-13b in the few-shot setting (purple line), a fine-tuned LoRA model (blue line), and our proposed LoRA ensembles (yellow dashed line) make wrong predictions on the cqa dataset. The few-shot approach is well-calibrated but often wrong, while LoRA (M=1) is more accurate but overconfident in its wrong predictions. Our approach provides improvements in both accuracy and calibration in terms of ECE.
  • Figure 2: LoRA ensembles improve both accuracy and calibration under different regularization techniques. Arrows link the performance of a single LoRA model (arrow tail) to the corresponding ensemble with 5 LoRA components (arrowhead), where the x-axis denotes validation accuracy and the y-axis expected calibration error. Arrow colors indicate regularization methods and opacity reflects regularization strength. The majority of arrows are pointing toward the right bottom corner, suggesting that ensembling benefits both accuracy and calibration error measured by ECE.
  • Figure 3: LoRA ensembles improves accuracy while regularization prevents NLL from blowing up. For all ensemble results we use $M=5$ components. We use $\lambda=1\mathrm{e}{3}$ for mmlu subsets and $\lambda=1\mathrm{e}{2}$ for others for weight decay.
  • Figure 4: Ensemble of LoRA significantly outperforms MC dropout under the same number of ensemble members. When employing dropout during the fine-tuning, an alternative ensemble strategy becomes available: Keeping dropout on at test time to implement Monte Carlo (MC) dropout. However, MC dropout offers only marginal performance gains compared to a standalone model, outperformed by ensembles of independently trained LoRA models when both methods employ the same number of ensemble members (chosen as 5 in our experiments).
  • Figure 5: Ensembles offer benefits for accuracy and calibration over regularized and unregularized fine-tuning approaches in OOD settings. Note that all methods show AUROC around or lower than $0.5$ on the second and third row, we suspect the models would fail to detect OOD samples if they can generalize to them, as the accuracy increases throughout fine-tuning.
  • ...and 3 more figures