Table of Contents
Fetching ...

Time-, Memory- and Parameter-Efficient Visual Adaptation

Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

TL;DR

We address the inefficiency of existing parameter-efficient fine-tuning methods by freezing the backbone and training a parallel, low-rank side adaptor (LoSA) that refines backbone features without gradient backpropagation through the backbone. The adaptor employs a low-rank mixer with channel- and token-mixing to model interactions, and extends to video by factorizing the token dimension into spatial and temporal axes. Empirically, LoSA achieves state-of-the-art accuracy-parameter trade-offs on VTAB, scales to large backbones for video (e.g., ViViT-e, 4B parameters) with reduced training time and memory, and outperforms prior adaptor-based or full-finetuning baselines under the same compute budget. The work demonstrates that training-time and memory efficiency can go hand-in-hand with strong accuracy, offering a practical path for adapting large foundation models to diverse downstream tasks.

Abstract

As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Time-, Memory- and Parameter-Efficient Visual Adaptation

TL;DR

We address the inefficiency of existing parameter-efficient fine-tuning methods by freezing the backbone and training a parallel, low-rank side adaptor (LoSA) that refines backbone features without gradient backpropagation through the backbone. The adaptor employs a low-rank mixer with channel- and token-mixing to model interactions, and extends to video by factorizing the token dimension into spatial and temporal axes. Empirically, LoSA achieves state-of-the-art accuracy-parameter trade-offs on VTAB, scales to large backbones for video (e.g., ViViT-e, 4B parameters) with reduced training time and memory, and outperforms prior adaptor-based or full-finetuning baselines under the same compute budget. The work demonstrates that training-time and memory efficiency can go hand-in-hand with strong accuracy, offering a practical path for adapting large foundation models to diverse downstream tasks.

Abstract

As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.
Paper Structure (32 sections, 6 equations, 6 figures, 13 tables)

This paper contains 32 sections, 6 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Parameter-efficient adaptation methods proposed in the literature are not necessarily efficient in terms of other efficiency metrics. Prompt-tuning jia2022visual, for example, learns only a few learnable prompt tokens, but is in fact slower than fully finetuning the network due to the added tokens. LoRA hu2021lora and BitFit zaken2021bitfit, do not substantially reduce training time as they still need to backpropagate through the whole network. Our method, Low-Rank Side Adaptation (LoSA), in contrast, achieves improvements across multiple efficiency metrics and tasks. Experiments conducted by adapting ViT-g with 1 billion parameters.
  • Figure 2: Computational graphs of Self-Attention (SA) for the backward pass in the backbone network when performing (a) Full finetuning which requires caching or recomputing gradients with respect to large activation tensors (red ovals), which is memory- and compute-intensive (b) LoRA hu2021lora and (c) our proposed LoSA. Although LoRA (b) trains only a small number of parameters per SA block -- namely $A_q$ and $B_q$, it still requires backpropagating gradients throughout the entire backbone. Thus the computational graph is quite similar to full finetuning ($\otimes$ denotes the multiplication of two-low rank matrices). Our method (c), in contrast, freezes the backbone completely and does not need to backpropagate through it at all, which results in significant reductions in training time and memory.
  • Figure 3: Overview of our approach. Top: We learn a parallel side network which iteratively refines the features obtained from a frozen backbone, $B$. Bottom: Our adaptation function consists of low-rank mixer modules, which allow for achieving high accuracy on a wide range of downstream tasks without sacrificing efficiency.
  • Figure 4: Comparison of trade-offs of accuracy with respect to learned parameters, training memory, inference GFLOPs and training speed. Our approach, LoSA, is consistently on the Pareto frontier (denoted by shaded yellow circles), as there is no method that is both more accurate and more efficient than it, across multiple efficiency metrics. Results are on the iNaturalist 2018 dataset, using a ViT-g backbone with 1 billion frozen parameters.
  • Figure 5: Comparison of trade-offs of accuracy with respect to learned parameters, training memory, inference GFLOPs and training speed. Our approach, LoSA, is consistently on the Pareto frontier (denoted by shaded yellow circles), as there is no method that is both more accurate and more efficient than it, across multiple efficiency metrics. Results are on the iNaturalist2021 dataset, using a ViT-g backbone with 1 billion frozen parameters.
  • ...and 1 more figures