Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

Ulrich Finkler; Irene Manotas; Wei Zhang; Geert Janssen; Octavian Popescu; Shyam Ramji

Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

Ulrich Finkler, Irene Manotas, Wei Zhang, Geert Janssen, Octavian Popescu, Shyam Ramji

TL;DR

The paper addresses the challenge of adapting LLMs to private enterprise code repositories for improved code completion when private data are unseen during training. It introduces an automated data ingestion pipeline that builds training data from semantic scopes within the code and compares two customization strategies: Retrieval-Augmented Generation (RAG) and supervised Fine-Tuning (FT). Across two large, private Java/C++ repositories (DataB and STM) and public benchmarks, FT on semantic-scope data consistently outperforms off-the-shelf models and RAG, with smaller models delivering faster, concise predictions suitable for on-prem use. The work demonstrates practical, scalable repository-level customization, reducing reliance on extensive human labeling and highlighting the value of semantic-scope data for improving predictiveness and developer productivity.

Abstract

Code completion (CC) is a task frequently used by developers when working in collaboration with LLM-based programming assistants. Despite the increased performance of LLMs on public benchmarks, out of the box LLMs still have a hard time generating code that aligns with a private code repository not previously seen by the model's training data. Customizing code LLMs to a private repository provides a way to improve the model performance. In this paper we present our approach for automated LLM customization based on semantic scopes in the code. We evaluate LLMs on real industry cases with two private enterprise code repositories with two customization strategies: Retrieval-Augmented Generation (RAG) and supervised Fine-Tuning (FT). Our mechanism for ingesting the repository's data and formulating the training data pairs with semantic scopes helps models to learn the underlying patterns specific to the repository, providing more precise code to developers and helping to boost their productivity. The code completions of moderately sized customized models can be significantly better than those of uncustomized models of much larger capacity. We also include an analysis of customization on two public benchmarks and present opportunities for future work.

Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

TL;DR

Abstract

Paper Structure (21 sections, 3 figures, 5 tables)

This paper contains 21 sections, 3 figures, 5 tables.

Introduction
Background
Repository-level Context for Coding Tasks
LLM Code Completion and Post Training Strategies
Benchmarks
Automated Data Preparation and Model Customization
Automated Data Preparation
LLM Customization
RAG
Fine Tuning
Methodology
Models
Datasets
Evaluation Metrics
Model Customization
...and 6 more sections

Figures (3)

Figure 1: Customization Pipeline Based on Semantic Scopes for Repository-level Code Completion.
Figure 2: Semantic Scope based Repository Data Ingestion. Left side shows the process flow; Right side shows two examples of semantic scopes (shaded boxes).
Figure 3: Performance (optimal and full Levenshtein Distance) of Baseline Models shown in logarithmic scale. (Left) Results on DataB repository's test set. (Right) Results on STM repository's test set. Shorter bar is better.

Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

TL;DR

Abstract

Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

Authors

TL;DR

Abstract

Table of Contents

Figures (3)