Studying LLM Performance on Closed- and Open-source Data

Toufique Ahmed; Christian Bird; Premkumar Devanbu; Saikat Chakraborty

Studying LLM Performance on Closed- and Open-source Data

Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty

TL;DR

This study evaluates off-the-shelf LLMs trained largely on open-source code against proprietary closed-source Microsoft data in two languages, C# and C++. It uses four software engineering tasks (token and line completion, code summarization, code generation) with zero-shot and few-shot prompts, augmented by BM25 retrieval and cross-source experimentation. Key findings show minimal degradation for C# when moving from OSS to closed-source, but substantial drops for C++—driven in part by identifier patterns and embedding differences. Importantly, incorporating open-source examples into few-shot prompts can improve performance on closed-source data, especially for C#, suggesting practical strategies to bridge OSS-trained models to industry codebases.

Abstract

Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.

Studying LLM Performance on Closed- and Open-source Data

TL;DR

Abstract

Paper Structure (41 sections, 5 figures, 8 tables)

This paper contains 41 sections, 5 figures, 8 tables.

Introduction
Background & Motivation
Open-source vs. Closed-source
Why Study Multiple Languages?
Research Questions
Tasks, Dataset, and Models
Tasks
Token Completion
Line Completion
Code Summarization
Code generation
Datasets
Open-source
Closed-source
Model(s) Used for Performance Evaluation
...and 26 more sections

Figures (5)

Figure 1: Different steps of our pipeline for code summarization task. (1) the target code is sent to the pool of samples, (2) n random/BM25 chosen samples are retrieved to build prompt, (3) prompt is build by appending the retrieved code-comment pair with the input code, (4) the prompt is sent to the GPT-3.x model, (5) target comment is extracted from the model.
Figure 2: Boxplot presenting the lengths of OSS and closed-source functions.
Figure 3: Sub-token count distribution of identifiers used in OSS and closed-source code.
Figure 4: Probability of pooling samples from closed-source data while both OSS and closed-source functions are used as sample pools.
Figure 5: t-SNE plots showing the embedding distance between OSS and closed-sourced functions in two programming languages.

Studying LLM Performance on Closed- and Open-source Data

TL;DR

Abstract

Studying LLM Performance on Closed- and Open-source Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)