ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale
Markus Frohmann, Carolin Holtermann, Shahed Masoudian, Anne Lauscher, Navid Rekabsaz
TL;DR
ScaLearn tackles the efficiency bottleneck of two-stage MTL by introducing a scaling-based transfer layer that reuses frozen source adapters. Its variants—ScaLearn, ScaLearnUniform, ScaLearn++, and ScaLearnUniform++—achieve competitive or superior transfer performance with dramatically fewer parameters than AdapterFusion, as demonstrated on GLUE, SuperGLUE, and HumSet across RoBERTa and XLM-R backbones. The method leverages simple, differentiable scaling on adapter outputs, without enforcing probability distributions on scaling coefficients, and retains strong few-shot transfer capabilities. Overall, ScaLearn shows that minimal, well-structured scaling of modular knowledge can unlock efficient, reusable cross-task transfer in NLP, aligning with GreenAI goals while maintaining high performance.
Abstract
Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning $n$ tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two-stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~ $0.35$% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only $8$ transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer.
