Table of Contents
Fetching ...

PVeRA: Probabilistic Vector-Based Random Matrix Adaptation

Leo Fillioux, Enzo Ferrante, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis

TL;DR

The paper tackles the challenge of efficiently adapting large foundation models under limited data and compute by introducing PVeRA, a probabilistic extension of VeRA that learns a distribution over low-rank adapters. It leverages reparameterization and KL regularization to enable sampling during training and inference, yielding uncertainty estimates and well-calibrated predictions. Empirically, PVeRA surpasses VeRA and other adapters on VTAB-1k while maintaining strong parameter efficiency and enabling inference-time merging of adapters. The approach also demonstrates uncertainty quantification, out-of-distribution detection, and preliminary NLP applicability, suggesting broad utility across vision and language tasks.

Abstract

Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.

PVeRA: Probabilistic Vector-Based Random Matrix Adaptation

TL;DR

The paper tackles the challenge of efficiently adapting large foundation models under limited data and compute by introducing PVeRA, a probabilistic extension of VeRA that learns a distribution over low-rank adapters. It leverages reparameterization and KL regularization to enable sampling during training and inference, yielding uncertainty estimates and well-calibrated predictions. Empirically, PVeRA surpasses VeRA and other adapters on VTAB-1k while maintaining strong parameter efficiency and enabling inference-time merging of adapters. The approach also demonstrates uncertainty quantification, out-of-distribution detection, and preliminary NLP applicability, suggesting broad utility across vision and language tasks.

Abstract

Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.

Paper Structure

This paper contains 20 sections, 11 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Probabilistic Vector-Based Random Matrix Adaptation. (a) PVeRA learns a distribution of latent adaptations, from which samples are drawn to compute the adaptation. (b) We showcase how a model adapted with PVeRA can be used to estimate confidence intervals for the prediction.
  • Figure 2: Representation of the VeRA and PVeRA architectures. (a) VeRA kopiczko2024vera on one Transformer encoder layer. (b) Our proposed PVeRA: a probabilistic variation of VeRA applied to the query and value components on the multi-head attention mechanism of the Transformer encoder layer. Pseudocode for PVeRA is shown in Appendix Section \ref{['supp_sec:pseudocode']}.
  • Figure 3: Comparison of the computation efficiency. (a) Number of trainable parameters of the adapters against the accuracy. (b) Number of parameters of the whole model adapted with each adapter against the accuracy. (c) FLOPS of a single adapter against the accuracy. (d) FLOPS of a whole model adapter with each adapter against the accuracy. Note that for the adapters for which a grid search over the hyperparameters is performed, the value of the number of parameters and FLOPS represents the average of the parameters and FLOPS respectively, weighted by the proportion of each chosen hyperparameter (see Appendix Section \ref{['supp_sec:grid_search']}).
  • Figure 4: Average calibration performance of adapters. Average ACE across all datasets for all considered adapters. Lower is better.
  • Figure 5: Uncertainty estimation visualization. Distribution of standard deviation of the softmax scores for correctly and incorrectly classified samples when using (a) 4 samples and (b) 16 samples. Results across all datasets. The significance levels correspond to p-values for a one-sided unpaired Wilcoxon test, and indicate distributions with significantly different values.
  • ...and 4 more figures