Table of Contents
Fetching ...

How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

Zhangqian Bi, Yao Wan, Zhaoyang Chu, Yufei Hu, Junyi Zhang, Hongyu Zhang, Guandong Xu, Hai Jin

TL;DR

The paper addresses the challenge of reusing Pre-trained Code Models (PCMs) by comparing naive selection strategies to learning-based model selection under a limited fine-tuning budget. It formulates transferability as a rankable score and proposes two families of strategies—proxy-based and distribution-based—to estimate PCM usefulness without exhaustive fine-tuning. Across 100 PCMs and three code tasks, learning-based selection achieves substantial time savings (from ~$2{,}700$ hours to ~ $100$ seconds) with minimal performance degradation (≤ $\le 6\%$) compared to brute-force tuning. The work provides practical, scalable guidelines for PCM reuse in AI-assisted software engineering, highlighting that model metadata alone is insufficient for effective selection and that budget-aware ranking can accelerate prototyping and deployment.

Abstract

Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used opensource PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.

How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

TL;DR

The paper addresses the challenge of reusing Pre-trained Code Models (PCMs) by comparing naive selection strategies to learning-based model selection under a limited fine-tuning budget. It formulates transferability as a rankable score and proposes two families of strategies—proxy-based and distribution-based—to estimate PCM usefulness without exhaustive fine-tuning. Across 100 PCMs and three code tasks, learning-based selection achieves substantial time savings (from ~ hours to ~ seconds) with minimal performance degradation (≤ ) compared to brute-force tuning. The work provides practical, scalable guidelines for PCM reuse in AI-assisted software engineering, highlighting that model metadata alone is insufficient for effective selection and that budget-aware ranking can accelerate prototyping and deployment.

Abstract

Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used opensource PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.
Paper Structure (25 sections, 11 equations, 8 figures, 3 tables)

This paper contains 25 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The pipeline of developing and using PCMs, from pre-training to fine-tuning
  • Figure 2: The accuracy of each PCM when adapted to the vulnerability detection task via fine-tuning. The accuracy is represented by gradients, with deeper gradients indicating higher values. Models are sorted in ascending order by size
  • Figure 3: The accuracy of (a) CodeBERT models and (b) PLBART models on the vulnerability detection task. The models are pre-trained on datasets of different sizes and programming languages
  • Figure 4: The time cost of brute-force fine-tuning and various learning strategies for model selection. The horizontal axis (Model Hub Size) represents the number of models involved in the selection process (measured per model), and the vertical axis (Time) indicates the selection time cost in seconds
  • Figure 5: An illustration of the (a) proxy-based and (b) distribution-based model selection strategies
  • ...and 3 more figures