Time-, Memory- and Parameter-Efficient Visual Adaptation
Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab
TL;DR
We address the inefficiency of existing parameter-efficient fine-tuning methods by freezing the backbone and training a parallel, low-rank side adaptor (LoSA) that refines backbone features without gradient backpropagation through the backbone. The adaptor employs a low-rank mixer with channel- and token-mixing to model interactions, and extends to video by factorizing the token dimension into spatial and temporal axes. Empirically, LoSA achieves state-of-the-art accuracy-parameter trade-offs on VTAB, scales to large backbones for video (e.g., ViViT-e, 4B parameters) with reduced training time and memory, and outperforms prior adaptor-based or full-finetuning baselines under the same compute budget. The work demonstrates that training-time and memory efficiency can go hand-in-hand with strong accuracy, offering a practical path for adapting large foundation models to diverse downstream tasks.
Abstract
As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.
