Table of Contents
Fetching ...

Why Personalizing Deep Learning-Based Code Completion Tools Matters

Alessandro Giagnorio, Alberto Martin-Lopez, Gabriele Bavota

TL;DR

The paper addresses whether fine-tuning DL-based code completion models on organization- or developer-specific data improves performance. It analyzes two organizations (Apache, Spring), two model families (T5 and Code Llama), and multiple sizes, using time-based splits and rigorous statistics to compare developer- and organization-specific personalization against generic baselines. Key findings show organization-specific fine-tuning yields the strongest improvements and can match or exceed the performance of significantly larger generic models, with favorable cost-effectiveness as smaller specialized models incur lower inference costs. The results generalize across architectures and suggest practical benefits for in-house code completion tools, including significant opportunities for cost savings and deployment simplicity. The work also provides a framework for assessing data specificity, model size, and cost trade-offs in code-recommender personalization. Overall, organization-level personalization emerges as the most reliable and scalable approach for real-world deployment, with developer-level gains limited by data availability but potentially enhanced via data augmentation. The paper contributes actionable guidance for building in-house, cost-efficient, personalized code completion systems and outlines future directions in online adaptation and broader language support.

Abstract

Deep learning (DL)-based code completion tools have transformed software development by enabling advanced code generation. These tools leverage models trained on vast amounts of code from numerous repositories, capturing general coding patterns. However, the impact of fine-tuning these models for specific organizations or developers to boost their performance on such subjects remains unexplored. In this work, we fill this gap by presenting solid empirical evidence answering this question. More specifically, we consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters). T5 models (60M, 750M) were pre-trained and fine-tuned on over 2,000 open-source projects, excluding the subject organizations' data, and compared against versions fine-tuned on organization- and developer-specific datasets. For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned via parameter-efficient fine-tuning on organization- and developer-specific datasets. Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning, with the former being particularly performant. Such a finding generalizes across (i) the two subject organizations (i.e., Apache and Spring) and (ii) models of completely different magnitude (from 60M to 7B trainable parameters). Finally, we show that DL models fine-tuned on an organization-specific dataset achieve the same completion performance of pre-trained code models used out of the box and being $\sim$10$\times$ larger, with consequent savings in terms of deployment and inference cost (e.g., smaller GPUs needed).

Why Personalizing Deep Learning-Based Code Completion Tools Matters

TL;DR

The paper addresses whether fine-tuning DL-based code completion models on organization- or developer-specific data improves performance. It analyzes two organizations (Apache, Spring), two model families (T5 and Code Llama), and multiple sizes, using time-based splits and rigorous statistics to compare developer- and organization-specific personalization against generic baselines. Key findings show organization-specific fine-tuning yields the strongest improvements and can match or exceed the performance of significantly larger generic models, with favorable cost-effectiveness as smaller specialized models incur lower inference costs. The results generalize across architectures and suggest practical benefits for in-house code completion tools, including significant opportunities for cost savings and deployment simplicity. The work also provides a framework for assessing data specificity, model size, and cost trade-offs in code-recommender personalization. Overall, organization-level personalization emerges as the most reliable and scalable approach for real-world deployment, with developer-level gains limited by data availability but potentially enhanced via data augmentation. The paper contributes actionable guidance for building in-house, cost-efficient, personalized code completion systems and outlines future directions in online adaptation and broader language support.

Abstract

Deep learning (DL)-based code completion tools have transformed software development by enabling advanced code generation. These tools leverage models trained on vast amounts of code from numerous repositories, capturing general coding patterns. However, the impact of fine-tuning these models for specific organizations or developers to boost their performance on such subjects remains unexplored. In this work, we fill this gap by presenting solid empirical evidence answering this question. More specifically, we consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters). T5 models (60M, 750M) were pre-trained and fine-tuned on over 2,000 open-source projects, excluding the subject organizations' data, and compared against versions fine-tuned on organization- and developer-specific datasets. For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned via parameter-efficient fine-tuning on organization- and developer-specific datasets. Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning, with the former being particularly performant. Such a finding generalizes across (i) the two subject organizations (i.e., Apache and Spring) and (ii) models of completely different magnitude (from 60M to 7B trainable parameters). Finally, we show that DL models fine-tuned on an organization-specific dataset achieve the same completion performance of pre-trained code models used out of the box and being 10 larger, with consequent savings in terms of deployment and inference cost (e.g., smaller GPUs needed).

Paper Structure

This paper contains 26 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Mining process to create the developer-specific datasets.
  • Figure 2: Developer- and organization-specific datasets.
  • Figure 3: Distributions of metrics correlating training and test sets across all 100 developers of Apache. Base = Baseline B${_s}$, Dev = Developer, Org = Organization. Higher is better.
  • Figure 4: Cost-effectiveness analysis: Generic T5$_{large}$vsorganization-specific T5$_{small}$.