LiCoEval: Evaluating LLMs on License Compliance in Code Generation
Weiwei Xu, Kai Gao, Hao He, Minghui Zhou
TL;DR
LiCoEval tackles license compliance in AI-assisted code by introducing a benchmark and empirically defining a practical standard for striking similarity between generated and open-source code. The authors construct LiCoEval from license-bearing OSS using World of Code data, then evaluate 14 popular LLMs, revealing non-negligible rates of strikingly similar outputs and limited license information accuracy, especially for copyleft licenses. The study highlights the need for improved data curation, license-aware training, and better attribution mechanisms to mitigate legal risks in AI-assisted software development. Overall, LiCoEval provides a foundation for future work to enhance license compliance, guiding model developers, users, and policymakers toward safer usage of LLMs in code generation.
Abstract
Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.
