ColA: Collaborative Adaptation with Gradient Learning
Enmao Diao, Qi Le, Suya Wu, Xinran Wang, Ali Anwar, Jie Ding, Vahid Tarokh
TL;DR
ColA with Gradient Learning (GL) tackles the computational bottleneck of fine-tuning large pretrained models by decoupling the gradient computations of hidden representations and adapter parameters and offloading the latter to low-cost devices. The framework proves a theoretical equivalence to classical gradient descent, introduces parameter merging to reduce on-device memory, and demonstrates—across sequence classification, sequence-to-sequence, and causal language modeling benchmarks—that ColA can match or beat PEFT baselines while greatly easing the computation space bottleneck. The FTaaS-oriented design enables multiple users to collaboratively fine-tune adapters without overloading the central GPU, offering a scalable path to personalized, deployable foundation models. Overall, ColA advances efficient, model-agnostic fine-tuning by combining functional gradient descent principles, gradient offloading, and collaborative adapters for practical, large-scale applications.
Abstract
A primary function of back-propagation is to compute both the gradient of hidden representations and parameters for optimization with gradient descent. Training large models requires high computational costs due to their vast parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to train smaller auxiliary models to save computational space, they still present computational overheads, especially in Fine-Tuning as a Service (FTaaS) for numerous users. We introduce Collaborative Adaptation (ColA) with Gradient Learning (GL), a parameter-free, model-agnostic fine-tuning approach that decouples the computation of the gradient of hidden representations and parameters. In comparison to PEFT methods, ColA facilitates more cost-effective FTaaS by offloading the computation of the gradient to low-cost devices. We also provide a theoretical analysis of ColA and experimentally demonstrate that ColA can perform on par or better than existing PEFT methods on various benchmarks.
