Table of Contents
Fetching ...

REDO: Execution-Free Runtime Error Detection for COding Agents

Shou Li, Andrey Kan, Laurent Callot, Bhavana Bhasker, Muhammad Shihab Rashid, Timothy B Esler

TL;DR

RedO is introduced, a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution, and it is demonstrated that REDO outperforms current state-of-the-art methods.

Abstract

As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection.

REDO: Execution-Free Runtime Error Detection for COding Agents

TL;DR

RedO is introduced, a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution, and it is demonstrated that REDO outperforms current state-of-the-art methods.

Abstract

As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection.

Paper Structure

This paper contains 34 sections, 13 figures, 6 tables, 2 algorithms.

Figures (13)

  • Figure 1: REDO outperforms current SOTA methods bieber2022staticpredictionruntimeerrors with respect to accuracy and weighted F1 (W.F1) on different tasks.
  • Figure 2: REDO employes a two-step process. When a repository-level modification is given, REDO first applies differential analysis to compare the original and modified implementations. If runtime errors are identified, REDO triggers an alert and rejects the patch. Conversely, if no errors are detected, the patch is forwarded to the LLM-based detection. The final determination regarding the patch’s safety is made based on the detection provided by the LLM.
  • Figure 3: Confusion Matrices on CodeStory
  • Figure 4: Successful example
  • Figure 5: Failed example
  • ...and 8 more figures