Table of Contents
Fetching ...

Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

Akshay Mhatre, Noujoud Nader, Patrick Diehl, Deepti Gupta

TL;DR

This paper systematically evaluates three leading LLMs—ChatGPT-4, Claude 3, and LLaMA 4—on their ability to detect bugs across beginner C++ issues, classic security vulnerabilities, and advanced real-world bugs in C++, Python, and OpenSSL-derived code. Using a diverse benchmark drawn from SEED Labs, OpenSSL/Suresoft GLaDOS, and PyBugHive, coupled with a novel multi-stage, context-aware prompting protocol and a graded rubric, the study reveals that LLMs excel at basic, well-scoped bugs but struggle with complex security and production-scale code. GPT-4 and Claude 3 generally deliver richer reasoning and remediation guidance than LLaMA 4, indicating meaningful progress and current constraints in deploying LLMs as reliable code analysis tools. The findings suggest substantial educational and first-pass auditing value, while highlighting the need for deeper contextual understanding and safer, more precise patch synthesis for production environments.

Abstract

Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.

Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

TL;DR

This paper systematically evaluates three leading LLMs—ChatGPT-4, Claude 3, and LLaMA 4—on their ability to detect bugs across beginner C++ issues, classic security vulnerabilities, and advanced real-world bugs in C++, Python, and OpenSSL-derived code. Using a diverse benchmark drawn from SEED Labs, OpenSSL/Suresoft GLaDOS, and PyBugHive, coupled with a novel multi-stage, context-aware prompting protocol and a graded rubric, the study reveals that LLMs excel at basic, well-scoped bugs but struggle with complex security and production-scale code. GPT-4 and Claude 3 generally deliver richer reasoning and remediation guidance than LLaMA 4, indicating meaningful progress and current constraints in deploying LLMs as reliable code analysis tools. The findings suggest substantial educational and first-pass auditing value, while highlighting the need for deeper contextual understanding and safer, more precise patch synthesis for production environments.

Abstract

Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.

Paper Structure

This paper contains 36 sections, 6 tables.