Table of Contents
Fetching ...

Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE

Benjamin Steenhoek, Kalpathy Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, Wei Le

TL;DR

This study tackles the gap between benchmark-ready AI vulnerability detection/fix tools and their practicality in real-world software development. It introduces DeepVulGuard, an IDE-integrated system using CodeBERT for detection and GPT-4 for explanations and fixes, evaluated through a real-world user study with 17 Microsoft developers across 24 projects. The findings reveal promising yet imperfect performance: high false positives, non-customizable fixes, and workflow disruption in its current form, but clear value in explanations, chat interactions, and confidence-based prioritization. The work offers actionable recommendations for evaluating and deploying AI-driven vulnerability tools in practice and contributes data and code to support future user studies. The results underscore the importance of contextual awareness, workflow integration, and consistent AI outputs for real-world adoption.

Abstract

This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool's usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users' perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user's codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at https://doi.org/10.6084/m9.figshare.26367139.

Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE

TL;DR

This study tackles the gap between benchmark-ready AI vulnerability detection/fix tools and their practicality in real-world software development. It introduces DeepVulGuard, an IDE-integrated system using CodeBERT for detection and GPT-4 for explanations and fixes, evaluated through a real-world user study with 17 Microsoft developers across 24 projects. The findings reveal promising yet imperfect performance: high false positives, non-customizable fixes, and workflow disruption in its current form, but clear value in explanations, chat interactions, and confidence-based prioritization. The work offers actionable recommendations for evaluating and deploying AI-driven vulnerability tools in practice and contributes data and code to support future user studies. The results underscore the importance of contextual awareness, workflow integration, and consistent AI outputs for real-world adoption.

Abstract

This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool's usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users' perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user's codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at https://doi.org/10.6084/m9.figshare.26367139.

Paper Structure

This paper contains 16 sections, 13 figures.

Figures (13)

  • Figure 1: Overview of DeepVulGuard's user interface on an example program. (1) An editor alert; (2) Problems menu entry; (3) The explanation of the alert; (4a) Quick fix interaction; (4b) Ignore options; (4c) Fix trigger; (5) Suggested fix; (6) Explanation of the fix suggestion; (7) Accept/Reject buttons.
  • Figure 2: An overview of DeepVulGuard's detection workflow. (1) Binary classification into vulnerable/not-vulnerable; (2) Localization; (3) Multi-class classification into one of 27 vulnerability types; (4) Alert and explanation shown to the user.
  • Figure 3: DeepVulGuard's LLM filter prompt.
  • Figure 4: DeepVulGuard's fix model prompt.
  • Figure 5: Performance of DeepVulGuard's detection component on vulnerabilities from SVEN.
  • ...and 8 more figures