Table of Contents
Fetching ...

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks

Dominik J. Mühlematter, Michelle Halbheer, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, Mehmet Ozgur Turkoglu

TL;DR

This work tackles the challenge of producing well-calibrated uncertainty estimates in large transformer models without the prohibitive cost of training and deploying full ensembles. It introduces LoRA-Ensemble, a parameter-efficient implicit ensemble that freezes the pre-trained backbone and attaches low-rank updates to the attention projections, with each ensemble member defined by its own $\Delta W_i = B_i A_i$. By averaging predictions across $N$ members and computing the ensemble variance, the method achieves accuracy and calibration that often surpass explicit ensembles and other implicit baselines, while dramatically reducing parameters and memory. Extensive experiments across CIFAR-100, HAM10000, iNaturalist, ESC-50, and SST-2 demonstrate strong predictive performance and superior calibration, with enhanced diversity in both function and weight spaces. The approach scales to large, fine-grained tasks and even transfers to CNNs, offering a practical and scalable route toward reliable uncertainty estimation in modern AI systems, with potential energy and environmental benefits. $W = W_0 + \Delta W = W_0 + B A$ and $h_i = W_0\cdot x + B_i A_i x$ are central to the method’s formulation and its empirical success.

Abstract

Numerous real-world decisions rely on machine learning algorithms and require calibrated uncertainty estimates. However, modern methods often yield overconfident, uncalibrated predictions. The dominant approach to quantifying the uncertainty inherent in the model is to train an ensemble of separate predictors and measure their empirical variance. In an explicit implementation, the ensemble has high computational cost and memory footprint, especially if the base model itself is already large, like modern transformers. This motivates efforts to develop implicit ensemble methods that emulate the ensemble without explicitly instantiating all its members. We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks. It is based on Low-Rank Adaptation (LoRA), originally developed for efficient LLM fine-tuning, and extends it into an implicit ensembling scheme, where all ensemble members share the same, pre-trained self-attention network, but have individual low-rank matrices for the attention projections. The resulting method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble, while at the same time achieving superior calibration.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks

TL;DR

This work tackles the challenge of producing well-calibrated uncertainty estimates in large transformer models without the prohibitive cost of training and deploying full ensembles. It introduces LoRA-Ensemble, a parameter-efficient implicit ensemble that freezes the pre-trained backbone and attaches low-rank updates to the attention projections, with each ensemble member defined by its own . By averaging predictions across members and computing the ensemble variance, the method achieves accuracy and calibration that often surpass explicit ensembles and other implicit baselines, while dramatically reducing parameters and memory. Extensive experiments across CIFAR-100, HAM10000, iNaturalist, ESC-50, and SST-2 demonstrate strong predictive performance and superior calibration, with enhanced diversity in both function and weight spaces. The approach scales to large, fine-grained tasks and even transfers to CNNs, offering a practical and scalable route toward reliable uncertainty estimation in modern AI systems, with potential energy and environmental benefits. and are central to the method’s formulation and its empirical success.

Abstract

Numerous real-world decisions rely on machine learning algorithms and require calibrated uncertainty estimates. However, modern methods often yield overconfident, uncalibrated predictions. The dominant approach to quantifying the uncertainty inherent in the model is to train an ensemble of separate predictors and measure their empirical variance. In an explicit implementation, the ensemble has high computational cost and memory footprint, especially if the base model itself is already large, like modern transformers. This motivates efforts to develop implicit ensemble methods that emulate the ensemble without explicitly instantiating all its members. We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks. It is based on Low-Rank Adaptation (LoRA), originally developed for efficient LLM fine-tuning, and extends it into an implicit ensembling scheme, where all ensemble members share the same, pre-trained self-attention network, but have individual low-rank matrices for the attention projections. The resulting method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble, while at the same time achieving superior calibration.
Paper Structure (53 sections, 28 equations, 16 figures, 20 tables)

This paper contains 53 sections, 28 equations, 16 figures, 20 tables.

Figures (16)

  • Figure 1: A schema of a lora-Ensemble. The computation structure of the multi-head self-attention module (right), and lora-Ensemble module (bottom left). $X$ denotes the actual input, and $x$ represents the intermediate input representation.
  • Figure 2: Function space analysis of lora-Ensemble vs. Explicit Ensemble.
  • Figure 3: Weight space analysis of lora-Ensemble vs. Explicit Ensemble.
  • Figure 4: Accuracy and ece on CIFAR-100, with different ensemble sizes.
  • Figure 5: Reliability diagrams for Explicit Ensemble (left) and lora-Ensemble (right) with 16 members, on CIFAR-100.
  • ...and 11 more figures