Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation

Mateusz Klimaszewski; Piotr Andruszkiewicz; Alexandra Birch

Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation

Mateusz Klimaszewski, Piotr Andruszkiewicz, Alexandra Birch

TL;DR

Is Modularity Transferable? investigates whether PEFT modular components can transfer across different PLMs. The authors frame the problem through a Knowledge Distillation lens and propose a simple approach to move pre-trained task-specific PEFT modules from a teacher to a student in the matching setting, and a pruning-and-alignment pipeline for incompatible models. In the pruning step, they sample embeddings, compute Pearson's correlation matrix $C$, solve a linear sum assignment to obtain a binary mapping $Z$, and prune unused weights in the down/up projection matrices $W$ to align latent spaces without affecting inference. Experiments on multilingual NER, NLI, and Paraphrase Identification with Adapter and LoRA show transferable modularity is feasible for matching PLMs, with SKIP transfers providing consistent gains over baseline and approaching teacher performance; results for incompatible models are inconsistent and highlight alignment challenges. The work highlights practical potential for reuse of pretrained modules across models, suggesting directions to robustify modular transfers for broader deployment.

Abstract

The rise of Modular Deep Learning showcases its potential in various Natural Language Processing applications. Parameter-efficient fine-tuning (PEFT) modularity has been shown to work for various use cases, from domain adaptation to multilingual setups. However, all this work covers the case where the modular components are trained and deployed within one single Pre-trained Language Model (PLM). This model-specific setup is a substantial limitation on the very modularity that modular architectures are trying to achieve. We ask whether current modular approaches are transferable between models and whether we can transfer the modules from more robust and larger PLMs to smaller ones. In this work, we aim to fill this gap via a lens of Knowledge Distillation, commonly used for model compression, and present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs. Moreover, we propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity. The experiments on Named Entity Recognition, Natural Language Inference, and Paraphrase Identification tasks over multiple languages and PEFT methods showcase the initial potential of transferable modularity.

Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation

TL;DR

, solve a linear sum assignment to obtain a binary mapping

, and prune unused weights in the down/up projection matrices

to align latent spaces without affecting inference. Experiments on multilingual NER, NLI, and Paraphrase Identification with Adapter and LoRA show transferable modularity is feasible for matching PLMs, with SKIP transfers providing consistent gains over baseline and approaching teacher performance; results for incompatible models are inconsistent and highlight alignment challenges. The work highlights practical potential for reuse of pretrained modules across models, suggesting directions to robustify modular transfers for broader deployment.

Abstract

Paper Structure (15 sections, 1 equation, 3 figures, 6 tables)

This paper contains 15 sections, 1 equation, 3 figures, 6 tables.

Introduction
Transferable Modularity
Pruning and Alignment
Experiments
Datasets
Training Setup
Baselines and Metrics
Results and Discussion
Matching Models
Incompatible Models
Conclusions
Acknowledgements
Bibliographical References
Experimental Setup
Per Language Results

Figures (3)

Figure 1: The most straightforward case of transferable modularity. The teacher model is first trained on a task using PEFT, e.g. Adapters, and then the student PEFT modules, prior to fine-tuning, are initialised with the teacher weights.
Figure 2: The schema of transferable modularity experiment. We investigate setups where the teacher-student pair result from task-agnostic distillation or are independently trained models.
Figure 3: Toy example of adapting the PEFT modules in the case of mismatched dimensionality. Based on the sampled embeddings (1.), correlation matrix $C$ is calculated (2.) and reduced via $LSA$ to a binary matrix $Z$ (3.). In the last step (4.), the pruning and alignment mapping function (derived from $Z$) is applied to down/up projection matrices of LoRA/Adapter modules and match dimensions.

Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation

TL;DR

Abstract

Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)