Table of Contents
Fetching ...

Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

Hokyung Lee, Sumanyu Sharma, Bing Hu

TL;DR

Bug In The Code Stack (BICS) introduces a code-focused benchmark to evaluate LLMs' ability to detect simple syntactic bugs in large Python code stacks. The approach assembles haystacks from the Python Alpaca dataset, injects seven bug types at various depths, and requires models to report the bug's line and type, across 11 LLMs and many context-length settings. Findings reveal substantial performance disparities across models, robust differences between code-based and text-based benchmarks, and varying degradation as context length increases. The work provides practitioners with guidance on model selection for coding tasks and identifies directions for improving LLM capabilities in software engineering, including planned extensions to runtime errors and languages like C++ and Rust.

Abstract

Recent research in Needle-in-a-Haystack (NIAH) benchmarks has explored the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents. However, as LLMs become increasingly integrated into software development processes, it is crucial to evaluate their performance in code-based environments. As LLMs are further developed for program synthesis, we need to ensure that LLMs can understand syntax and write syntactically correct code. As a step in ensuring LLMs understand syntax, LLMs can be evaluated in their ability to find and detect syntax bugs. Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code. Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation, though the extent of this degradation varies between models.

Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

TL;DR

Bug In The Code Stack (BICS) introduces a code-focused benchmark to evaluate LLMs' ability to detect simple syntactic bugs in large Python code stacks. The approach assembles haystacks from the Python Alpaca dataset, injects seven bug types at various depths, and requires models to report the bug's line and type, across 11 LLMs and many context-length settings. Findings reveal substantial performance disparities across models, robust differences between code-based and text-based benchmarks, and varying degradation as context length increases. The work provides practitioners with guidance on model selection for coding tasks and identifies directions for improving LLM capabilities in software engineering, including planned extensions to runtime errors and languages like C++ and Rust.

Abstract

Recent research in Needle-in-a-Haystack (NIAH) benchmarks has explored the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents. However, as LLMs become increasingly integrated into software development processes, it is crucial to evaluate their performance in code-based environments. As LLMs are further developed for program synthesis, we need to ensure that LLMs can understand syntax and write syntactically correct code. As a step in ensuring LLMs understand syntax, LLMs can be evaluated in their ability to find and detect syntax bugs. Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code. Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation, though the extent of this degradation varies between models.
Paper Structure (8 sections, 2 figures, 4 tables)

This paper contains 8 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: BICS results for all 11 selected models: (1) GPT-4o, (2) GPT-4-Turbo, (3) Claude-3.5 Sonnet, (4) Claude-3 Opus, (5) Gemini 1.5 Pro, (6) Gemini 1.5 Flash, (7) GPT-3.5 Turbo, (8) Codestral, (9) Llama3 70B, (10) Command-r+, and (11) Gemini 1.0 Pro.
  • Figure 2: Comparing retrieval accuracy of GPT-3.5-Turbo on BICS versus BABILong benchmark