Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering

Iman Saberi; Fatemeh Fard; Fuxiang Chen

Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering

Iman Saberi, Fatemeh Fard, Fuxiang Chen

TL;DR

This work tackles the challenge of efficiently transferring knowledge from pre-trained language models to software engineering tasks. It introduces MODE-X, a cross-modal adapter framework that injects adapters trained on code into NL-PLMs to perform code-related tasks, and evaluates them on cloze tests, code clone detection, and code summarization, comparing against strong C-PLMs. The study further shows that adapters embedded in C-PLMs can improve performance on several SE tasks while remaining significantly more parameter-efficient than full fine-tuning, with probing and attention analyses elucidating how adapters reorganize representations toward code semantics. The findings suggest adapters enable scalable, resource-efficient knowledge transfer for SE, with practical implications for integrating such models into real-world development tools and IDEs. Overall, the paper demonstrates that adapters can bridge modalities and languages in SE while reducing training and storage demands, opening pathways for broader adoption and multilanguage support.

Abstract

Software Engineering (SE) Pre-trained Language Models (PLMs), such as CodeBERT, are pre-trained on large code corpora, and their learned knowledge has shown success in transferring into downstream tasks (e.g., code clone detection) through the fine-tuning of PLMs. In Natural Language Processing (NLP), an alternative in transferring the knowledge of PLMs is explored through the use of adapter, a compact and parameter efficient module that is inserted into a PLM. Although the use of adapters has shown promising results in many NLP-based downstream tasks, their application and exploration in SE-based downstream tasks are limited. Here, we study the knowledge transfer using adapters on multiple down-stream tasks including cloze test, code clone detection, and code summarization. These adapters are trained on code corpora and are inserted into a PLM that is pre-trained on English corpora or code corpora. We called these PLMs as NL-PLM and C-PLM, respectively. We observed an improvement in results using NL-PLM over a PLM that does not have adapters, and this suggested that adapters can transfer and utilize useful knowledge from NL-PLM to SE tasks. The results are sometimes on par with or exceed the results of C-PLM; while being more efficient in terms of the number of parameters and training time. Interestingly, adapters inserted into a C-PLM generally yield better results than a traditional fine-tuned C-PLM. Our results open new directions to build more compact models for SE tasks.

Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering

TL;DR

Abstract

Paper Structure (39 sections, 5 equations, 8 figures, 6 tables)

This paper contains 39 sections, 5 equations, 8 figures, 6 tables.

Introduction
Literature Review
Background
Transformers and PLMs
Adapters
Study Design
Methodology Overview
Code Representation Using Adapters
Experimental Setup
Dataset
Task
Training L-adapters
Baselines
Evaluation Metric
Results
...and 24 more sections

Figures (8)

Figure 1: Language, task, and invertible adapters in the MAD-X framework, adapted from pfeiffer2020madX.
Figure 2: The accuracy of predicting code length in the probing task is evaluated on each layer of the models. The x-axis represents the classification results, with the first layer being the input embeddings, serving as the naive baseline accuracy. All models have 12 Transformer layers, and the y-axis displays the accuracy at each layer for each model.
Figure 3: The accuracy of predicting cyclomatic complexity in the probing task is evaluated on each layer of the model. The x-axis represents the classification results at each layer, with the first layer being the input embeddings, which serves as the naive baseline accuracy. All models utilized in the experiment have 12 transformer layers, and the y-axis displays the accuracy at each layer for each model.
Figure 4: Accuracies of AST node tagging probing task. The x-axis demonstrates the classification results at each layer (the first layer is the input embeddings, which represent the naive baseline accuracy). All models have 12 Transformer layers. The y-axis shows the accuracy at each layer for each model.
Figure 5: An illustrative example of how adapters affect the last layer of RoBERTa when a Go sample is fed to the model. The left figure shows the attention of the third head on the function name sum without adapters whereas the right figure depicts the same attention head with adapters. As shown, while RoBERTa without adapters only attends to the local token neighbors, RoBERTa equipped with adapters has an in-depth knowledge of the code and pays more attention to the parts that are more related to the function name (e.g., it has strong attention to the func keyword which suggests that it knows that sum is somehow related to that keyword).
...and 3 more figures

Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering

TL;DR

Abstract

Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (8)