Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE
Benjamin Steenhoek, Kalpathy Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, Wei Le
TL;DR
This study tackles the gap between benchmark-ready AI vulnerability detection/fix tools and their practicality in real-world software development. It introduces DeepVulGuard, an IDE-integrated system using CodeBERT for detection and GPT-4 for explanations and fixes, evaluated through a real-world user study with 17 Microsoft developers across 24 projects. The findings reveal promising yet imperfect performance: high false positives, non-customizable fixes, and workflow disruption in its current form, but clear value in explanations, chat interactions, and confidence-based prioritization. The work offers actionable recommendations for evaluating and deploying AI-driven vulnerability tools in practice and contributes data and code to support future user studies. The results underscore the importance of contextual awareness, workflow integration, and consistent AI outputs for real-world adoption.
Abstract
This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool's usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users' perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user's codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at https://doi.org/10.6084/m9.figshare.26367139.
