Table of Contents
Fetching ...

An Investigation into Misuse of Java Security APIs by Large Language Models

Zahra Mousavi, Chadni Islam, Kristen Moore, Alsharif Abuadbba, Muhammad Ali Babar

TL;DR

This paper investigates how a leading LLM, ChatGPT, performs when generating Java code that uses security APIs. By constructing 48 tasks across five security APIs and employing a hybrid misuse-detection approach, the authors quantify misuse rates and categorize 20 distinct misuse types. The findings reveal a high overall misuse rate (~70%), with certain APIs (e.g., OAuth, Biometrics) showing near-100% misuse, and others (e.g., PRNG) performing relatively well. The work underscores the need for improved API usability, up-to-date model knowledge, and dedicated tooling to audit and repair AI-generated security code, offering a dataset and framework to guide future research and practice.

Abstract

The increasing trend of using Large Language Models (LLMs) for code generation raises the question of their capability to generate trustworthy code. While many researchers are exploring the utility of code generation for uncovering software vulnerabilities, one crucial but often overlooked aspect is the security Application Programming Interfaces (APIs). APIs play an integral role in upholding software security, yet effectively integrating security APIs presents substantial challenges. This leads to inadvertent misuse by developers, thereby exposing software to vulnerabilities. To overcome these challenges, developers may seek assistance from LLMs. In this paper, we systematically assess ChatGPT's trustworthiness in code generation for security API use cases in Java. To conduct a thorough evaluation, we compile an extensive collection of 48 programming tasks for 5 widely used security APIs. We employ both automated and manual approaches to effectively detect security API misuse in the code generated by ChatGPT for these tasks. Our findings are concerning: around 70% of the code instances across 30 attempts per task contain security API misuse, with 20 distinct misuse types identified. Moreover, for roughly half of the tasks, this rate reaches 100%, indicating that there is a long way to go before developers can rely on ChatGPT to securely implement security API code.

An Investigation into Misuse of Java Security APIs by Large Language Models

TL;DR

This paper investigates how a leading LLM, ChatGPT, performs when generating Java code that uses security APIs. By constructing 48 tasks across five security APIs and employing a hybrid misuse-detection approach, the authors quantify misuse rates and categorize 20 distinct misuse types. The findings reveal a high overall misuse rate (~70%), with certain APIs (e.g., OAuth, Biometrics) showing near-100% misuse, and others (e.g., PRNG) performing relatively well. The work underscores the need for improved API usability, up-to-date model knowledge, and dedicated tooling to audit and repair AI-generated security code, offering a dataset and framework to guide future research and practice.

Abstract

The increasing trend of using Large Language Models (LLMs) for code generation raises the question of their capability to generate trustworthy code. While many researchers are exploring the utility of code generation for uncovering software vulnerabilities, one crucial but often overlooked aspect is the security Application Programming Interfaces (APIs). APIs play an integral role in upholding software security, yet effectively integrating security APIs presents substantial challenges. This leads to inadvertent misuse by developers, thereby exposing software to vulnerabilities. To overcome these challenges, developers may seek assistance from LLMs. In this paper, we systematically assess ChatGPT's trustworthiness in code generation for security API use cases in Java. To conduct a thorough evaluation, we compile an extensive collection of 48 programming tasks for 5 widely used security APIs. We employ both automated and manual approaches to effectively detect security API misuse in the code generated by ChatGPT for these tasks. Our findings are concerning: around 70% of the code instances across 30 attempts per task contain security API misuse, with 20 distinct misuse types identified. Moreover, for roughly half of the tasks, this rate reaches 100%, indicating that there is a long way to go before developers can rely on ChatGPT to securely implement security API code.
Paper Structure (37 sections, 5 figures, 2 tables)

This paper contains 37 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A misuse of SSL/TLS API leading to the leakage of user personal information
  • Figure 2: An overview of the evaluation framework to study the application of LLMs for programming with security APIs
  • Figure 3: An overview of task design for security APIs
  • Figure 4: Analysis Results of GPT-4 Responses for Security Functionality Programming Tasks
  • Figure 5: Misuse Rates for Security Functionality Programming Tasks