LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks
Dominik J. Mühlematter, Michelle Halbheer, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, Mehmet Ozgur Turkoglu
TL;DR
This work tackles the challenge of producing well-calibrated uncertainty estimates in large transformer models without the prohibitive cost of training and deploying full ensembles. It introduces LoRA-Ensemble, a parameter-efficient implicit ensemble that freezes the pre-trained backbone and attaches low-rank updates to the attention projections, with each ensemble member defined by its own $\Delta W_i = B_i A_i$. By averaging predictions across $N$ members and computing the ensemble variance, the method achieves accuracy and calibration that often surpass explicit ensembles and other implicit baselines, while dramatically reducing parameters and memory. Extensive experiments across CIFAR-100, HAM10000, iNaturalist, ESC-50, and SST-2 demonstrate strong predictive performance and superior calibration, with enhanced diversity in both function and weight spaces. The approach scales to large, fine-grained tasks and even transfers to CNNs, offering a practical and scalable route toward reliable uncertainty estimation in modern AI systems, with potential energy and environmental benefits. $W = W_0 + \Delta W = W_0 + B A$ and $h_i = W_0\cdot x + B_i A_i x$ are central to the method’s formulation and its empirical success.
Abstract
Numerous real-world decisions rely on machine learning algorithms and require calibrated uncertainty estimates. However, modern methods often yield overconfident, uncalibrated predictions. The dominant approach to quantifying the uncertainty inherent in the model is to train an ensemble of separate predictors and measure their empirical variance. In an explicit implementation, the ensemble has high computational cost and memory footprint, especially if the base model itself is already large, like modern transformers. This motivates efforts to develop implicit ensemble methods that emulate the ensemble without explicitly instantiating all its members. We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks. It is based on Low-Rank Adaptation (LoRA), originally developed for efficient LLM fine-tuning, and extends it into an implicit ensembling scheme, where all ensemble members share the same, pre-trained self-attention network, but have individual low-rank matrices for the attention projections. The resulting method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble, while at the same time achieving superior calibration.
