Table of Contents
Fetching ...

ZeroFalse: Improving Precision in Static Analysis with LLMs

Mohsen Iranmanesh, Sina Moradi Sabet, Sina Marefat, Ali Javidi Ghasr, Allison Wilson, Iman Sharafaldin, Mohammad A. Tayebi

TL;DR

Static analysis tools struggle with high false positives, eroding developer trust. ZeroFalse couples static analysis with LLM adjudication by enriching SARIF alerts with flow-sensitive dataflow traces and CWE-specific knowledge, using deterministic, schema-constrained prompts. Empirical results across ten LLMs and two datasets show that CWE-aware prompting and reasoning-oriented models achieve strong $F1$-scores while maintaining high recall, enabling practical CI/CD integration. The work demonstrates that structured context and domain-specific reasoning are critical for robust false-positive mitigation in real-world codebases. This approach paves the way for more reliable, scalable SAST-assisted security in large-scale software development pipelines.

Abstract

Static Application Security Testing (SAST) tools are integral to modern software development, yet their adoption is undermined by excessive false positives that weaken developer trust and demand costly manual triage. We present ZeroFalse, a framework that integrates static analysis with large language models (LLMs) to reduce false positives while preserving coverage. ZeroFalse treats static analyzer outputs as structured contracts, enriching them with flow-sensitive traces, contextual evidence, and CWE-specific knowledge before adjudication by an LLM. This design preserves the systematic reach of static analysis while leveraging the reasoning capabilities of LLMs. We evaluate ZeroFalse across both benchmarks and real-world projects using ten state-of-the-art LLMs. Our best-performing models achieve F1-scores of 0.912 on the OWASP Java Benchmark and 0.955 on the OpenVuln dataset, maintaining recall and precision above 90%. Results further show that CWE-specialized prompting consistently outperforms generic prompts, and reasoning-oriented LLMs provide the most reliable precision-recall balance. These findings position ZeroFalse as a practical and scalable approach for enhancing the reliability of SAST and supporting its integration into real-world CI/CD pipelines.

ZeroFalse: Improving Precision in Static Analysis with LLMs

TL;DR

Static analysis tools struggle with high false positives, eroding developer trust. ZeroFalse couples static analysis with LLM adjudication by enriching SARIF alerts with flow-sensitive dataflow traces and CWE-specific knowledge, using deterministic, schema-constrained prompts. Empirical results across ten LLMs and two datasets show that CWE-aware prompting and reasoning-oriented models achieve strong -scores while maintaining high recall, enabling practical CI/CD integration. The work demonstrates that structured context and domain-specific reasoning are critical for robust false-positive mitigation in real-world codebases. This approach paves the way for more reliable, scalable SAST-assisted security in large-scale software development pipelines.

Abstract

Static Application Security Testing (SAST) tools are integral to modern software development, yet their adoption is undermined by excessive false positives that weaken developer trust and demand costly manual triage. We present ZeroFalse, a framework that integrates static analysis with large language models (LLMs) to reduce false positives while preserving coverage. ZeroFalse treats static analyzer outputs as structured contracts, enriching them with flow-sensitive traces, contextual evidence, and CWE-specific knowledge before adjudication by an LLM. This design preserves the systematic reach of static analysis while leveraging the reasoning capabilities of LLMs. We evaluate ZeroFalse across both benchmarks and real-world projects using ten state-of-the-art LLMs. Our best-performing models achieve F1-scores of 0.912 on the OWASP Java Benchmark and 0.955 on the OpenVuln dataset, maintaining recall and precision above 90%. Results further show that CWE-specialized prompting consistently outperforms generic prompts, and reasoning-oriented LLMs provide the most reliable precision-recall balance. These findings position ZeroFalse as a practical and scalable approach for enhancing the reliability of SAST and supporting its integration into real-world CI/CD pipelines.

Paper Structure

This paper contains 22 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The ZeroFalse pipeline. CodeQL generates alerts, related code context is collected, the dataflow is extracted and annotated, CWE-specific prompts are constructed, and finally, an LLM classifies alerts to identify false positives.
  • Figure 2: Structure of the adjudication prompt template, divided into five ordered segments.
  • Figure 3: Comparison of models sorted by Precision, Recall, and F1-score. (a) OWASP. (b) OpenVuln.
  • Figure 4: Heatmap of F1-scores across CWE categories (rows) and models (columns) on OWASP. Lighter shading indicates higher performance.
  • Figure 5: The figure compares model performance on OWASP (left) and OpenVuln (right), with bubbles positioned by latency (X-axis) and cost (Y-axis), and sized by F1-score to illustrate efficiency and generalization.