ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

Markus Frohmann; Carolin Holtermann; Shahed Masoudian; Anne Lauscher; Navid Rekabsaz

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

Markus Frohmann, Carolin Holtermann, Shahed Masoudian, Anne Lauscher, Navid Rekabsaz

TL;DR

ScaLearn tackles the efficiency bottleneck of two-stage MTL by introducing a scaling-based transfer layer that reuses frozen source adapters. Its variants—ScaLearn, ScaLearnUniform, ScaLearn++, and ScaLearnUniform++—achieve competitive or superior transfer performance with dramatically fewer parameters than AdapterFusion, as demonstrated on GLUE, SuperGLUE, and HumSet across RoBERTa and XLM-R backbones. The method leverages simple, differentiable scaling on adapter outputs, without enforcing probability distributions on scaling coefficients, and retains strong few-shot transfer capabilities. Overall, ScaLearn shows that minimal, well-structured scaling of modular knowledge can unlock efficient, reusable cross-task transfer in NLP, aligning with GreenAI goals while maintaining high performance.

Abstract

Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning $n$ tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two-stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~ $0.35$% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only $8$ transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer.

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

TL;DR

Abstract

Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning

tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two-stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~

% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only

transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer.

Paper Structure (18 sections, 3 equations, 10 figures, 21 tables)

This paper contains 18 sections, 3 equations, 10 figures, 21 tables.

Introduction
Contributions.
Background
ScaLearn -- Learning to Scale for Knowledge Transfer
Experiment Setup
Results
Parameter-efficiency analysis
Transfer Learning Performance
Few-shot Transfer Learning
Related Work
Conclusion
Appendix
Complete Experiment Details
Analysis on Scaling Output Representations
Ablation Study
...and 3 more sections

Figures (10)

Figure 1: Performance and parameter-efficiency of single task learning (STL), and joint/two-stage MTL methods, evaluated on GLUE GLUE and SuperGLUE SuperGLUE using $\text{RoBERTaBASE}$roberta. The reported values for the two-stage MTL methods only consider the ones in the respective transfer layers. The full details of the learnable parameters and performance results are provided in §\ref{['sec:results']}.
Figure 2: Few-shot transfer learning results with $k=\text{\{4,16,32,100\}}$ training samples for each target task using the $\text{BASE}$ models of $\text{RoBERTa}$ and $\text{XLM-R}$. Full results over several runs are provided in Appendix \ref{['sec:more-fs-results']}.
Figure 3: Probing results of 4 target tasks in various transfer learning conditions. (Top) Effect of scaling the output representations of adapters by weight $\omega_s$ using different source adapters. (Bottom) Effect of combining independently scaled output representations of two adapters trained on the target task and MNLI, respectively. Each point shows the mean over 5 seeds.
Figure 4: Few-shot learning results ($k= \text{\{4,16,32,100\}}$) comparing Adapter, AdapterFusion, and ScaLearn using $\text{RoBERTaLARGE}$ on three benchmarks. We show the mean across 5 seeds. For AdapterFusion and ScaLearn, we assume that there is a Pfeiffer adapter trained on the target task on $k$ samples and a Pfeiffer adapter trained on all samples for all other tasks available.
Figure 5: Effect of scaling the output representations ${\bm{o}}^{l}_{s}$ of adapters by weight $\omega_s$ using different source adapters from all other tasks from GLUE and SuperGLUE. Each point shows the mean over 5 seeds.
...and 5 more figures

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

TL;DR

Abstract

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (10)