Table of Contents
Fetching ...

Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs

Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, Shouling Ji

TL;DR

The paper addresses the problem of cryptographic API misuse detection and the limitations of pattern-based SATs, proposing LLM-based approaches to leverage contextual understanding for more robust detection. It introduces a rigorous evaluation framework with refined manually-crafted and real-world benchmarks, comparing five SOTA LLMs under unconstrained and task-aware settings, augmented by a code & analysis validation workflow. The study reveals that LLMs exhibit substantial false positives due to stochasticity but can reach around 90% detection when guided by domain-specific scope and validation, surpassing traditional SATs in several benchmarks and uncovering new misuses. A real-world usability study further demonstrates practical value by identifying 63 misuses across 28 projects, with strong developer acceptance and a path toward actionability, while also outlining failure patterns and recommendations for future LLM-based security tooling.

Abstract

While the automated detection of cryptographic API misuses has progressed significantly, its precision diminishes for intricate targets due to the reliance on manually defined patterns. Large Language Models (LLMs), renowned for their contextual understanding, offer a promising avenue to address existing shortcomings. However, applying LLMs in this security-critical domain presents challenges, particularly due to the unreliability stemming from LLMs' stochastic nature and the well-known issue of hallucination. To explore the prevalence of LLMs' unreliable analysis and potential solutions, this paper introduces a systematic evaluation framework to assess LLMs in detecting cryptographic misuses, utilizing a comprehensive dataset encompassing both manually-crafted samples and real-world projects. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. Nevertheless, we demonstrate how a constrained problem scope, coupled with LLMs' self-correction capability, significantly enhances the reliability of the detection. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks. Moreover, we identify the failure patterns that persistently hinder LLMs' reliability, including both cryptographic knowledge deficiency and code semantics misinterpretation. Guided by these insights, we develop an LLM-based workflow to examine open-source repositories, leading to the discovery of 63 real-world cryptographic misuses. Of these, 46 have been acknowledged by the development community, with 23 currently being addressed and 6 resolved. Reflecting on developers' feedback, we offer recommendations for future research and the development of LLM-based security tools.

Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs

TL;DR

The paper addresses the problem of cryptographic API misuse detection and the limitations of pattern-based SATs, proposing LLM-based approaches to leverage contextual understanding for more robust detection. It introduces a rigorous evaluation framework with refined manually-crafted and real-world benchmarks, comparing five SOTA LLMs under unconstrained and task-aware settings, augmented by a code & analysis validation workflow. The study reveals that LLMs exhibit substantial false positives due to stochasticity but can reach around 90% detection when guided by domain-specific scope and validation, surpassing traditional SATs in several benchmarks and uncovering new misuses. A real-world usability study further demonstrates practical value by identifying 63 misuses across 28 projects, with strong developer acceptance and a path toward actionability, while also outlining failure patterns and recommendations for future LLM-based security tooling.

Abstract

While the automated detection of cryptographic API misuses has progressed significantly, its precision diminishes for intricate targets due to the reliance on manually defined patterns. Large Language Models (LLMs), renowned for their contextual understanding, offer a promising avenue to address existing shortcomings. However, applying LLMs in this security-critical domain presents challenges, particularly due to the unreliability stemming from LLMs' stochastic nature and the well-known issue of hallucination. To explore the prevalence of LLMs' unreliable analysis and potential solutions, this paper introduces a systematic evaluation framework to assess LLMs in detecting cryptographic misuses, utilizing a comprehensive dataset encompassing both manually-crafted samples and real-world projects. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. Nevertheless, we demonstrate how a constrained problem scope, coupled with LLMs' self-correction capability, significantly enhances the reliability of the detection. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks. Moreover, we identify the failure patterns that persistently hinder LLMs' reliability, including both cryptographic knowledge deficiency and code semantics misinterpretation. Guided by these insights, we develop an LLM-based workflow to examine open-source repositories, leading to the discovery of 63 real-world cryptographic misuses. Of these, 46 have been acknowledged by the development community, with 23 currently being addressed and 6 resolved. Reflecting on developers' feedback, we offer recommendations for future research and the development of LLM-based security tools.
Paper Structure (46 sections, 8 figures, 4 tables)

This paper contains 46 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The Evaluation Framework for LLM-based Cryptographic Misuse Detection.
  • Figure 2: Number of Test Cases where LLMs Report Unexpected Alerts.
  • Figure 3: LLMs' Detection Accuracy across Test Cases with Different Complexity.
  • Figure 4: Detection Accuracy Comparison Across Benchmarks
  • Figure 5: Prompt for Misuse Detection.
  • ...and 3 more figures