LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Weiwei Xu; Kai Gao; Hao He; Minghui Zhou

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Weiwei Xu, Kai Gao, Hao He, Minghui Zhou

TL;DR

LiCoEval tackles license compliance in AI-assisted code by introducing a benchmark and empirically defining a practical standard for striking similarity between generated and open-source code. The authors construct LiCoEval from license-bearing OSS using World of Code data, then evaluate 14 popular LLMs, revealing non-negligible rates of strikingly similar outputs and limited license information accuracy, especially for copyleft licenses. The study highlights the need for improved data curation, license-aware training, and better attribution mechanisms to mitigate legal risks in AI-assisted software development. Overall, LiCoEval provides a foundation for future work to enhance license compliance, guiding model developers, users, and policymakers toward safer usage of LLMs in code generation.

Abstract

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

TL;DR

Abstract

Paper Structure (32 sections, 1 equation, 8 figures, 4 tables)

This paper contains 32 sections, 1 equation, 8 figures, 4 tables.

Introduction
Background and Related work
License Compliance and IP Infringement
Memorization in LLMs for Code
Evaluations of LLMs for Code Generation
Empirical Study on Standard of Striking Similarity
Research Question
Selection of LLMs
Experiment setup
Method
Construction of Code Samples
Features to characterize String Similarity
Results
Validation
Evaluation Framework and Benchmark for LLM License Compliance
...and 17 more sections

Figures (8)

Figure 1: Overview of this study.
Figure 2: Structure of function-level code snippet.
Figure 3: Overview of code samples construction.
Figure 4: The similarity between output of WizardCoder and the corresponding open-source code in two groups.
Figure 5: The distribution of similarity between the generated code and the corresponding open-source implementations in two groups, in relation to the number of function body lines, cyclomatic complexity, and the number of same comments. The similarity value is the maximum of the three text similarity metrics.
...and 3 more figures

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

TL;DR

Abstract

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)