Table of Contents
Fetching ...

Leveraging Print Debugging to Improve Code Generation in Large Language Models

Xueyu Hu, Kun Kuang, Jiankai Sun, Hongxia Yang, Fei Wu

TL;DR

The paper tackles the challenge of generating correct code for problems with complex data structures by introducing a print-debugging in-context learning loop for LLMs. It guides models to insert and analyze print-based logs to identify and fix bugs, leveraging test cases and execution traces as interpretable feedback. Evaluated on a LeetCode dataset with GPT-4, the approach significantly improves easy and medium problem performance over rubber duck debugging, while hard problems remain resistant to improvement. Ablation and case-study analyses emphasize that combining test-case explanations with execution logs is key to effective debugging, suggesting a path toward more robust, log-informed code generation in LLMs.

Abstract

Large language models (LLMs) have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose an in-context learning approach that guides LLMs to debug by using a "print debugging" method, which involves inserting print statements to trace and analysing logs for fixing the bug. We collect a Leetcode problem dataset and evaluate our method using the Leetcode online judging system. Experiments with GPT-4 demonstrate the effectiveness of our approach, outperforming rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%.

Leveraging Print Debugging to Improve Code Generation in Large Language Models

TL;DR

The paper tackles the challenge of generating correct code for problems with complex data structures by introducing a print-debugging in-context learning loop for LLMs. It guides models to insert and analyze print-based logs to identify and fix bugs, leveraging test cases and execution traces as interpretable feedback. Evaluated on a LeetCode dataset with GPT-4, the approach significantly improves easy and medium problem performance over rubber duck debugging, while hard problems remain resistant to improvement. Ablation and case-study analyses emphasize that combining test-case explanations with execution logs is key to effective debugging, suggesting a path toward more robust, log-informed code generation in LLMs.

Abstract

Large language models (LLMs) have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose an in-context learning approach that guides LLMs to debug by using a "print debugging" method, which involves inserting print statements to trace and analysing logs for fixing the bug. We collect a Leetcode problem dataset and evaluate our method using the Leetcode online judging system. Experiments with GPT-4 demonstrate the effectiveness of our approach, outperforming rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%.
Paper Structure (15 sections, 6 figures, 2 tables)

This paper contains 15 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparative Workflow: Print Debugging vs. Rubber Duck Debugging. From up to down, LLMs generate code for a Leetcode problem, and subsequently subject it to testing on the Leetcode online judging system. If not all the test cases are passed, LLMs proceed to the debugging procedure. For rubber duck debugging (left down), LLMs explain the code line by line and then fix the bug according to the explanation. For print debugging (right down), LLMs insert print statements, get the log and debug according to explanation of test case and the log.
  • Figure 2: Illustration of adding print statements into the buggy code. LLMs are prompted to add several print statements to the buggy code, but don't change the rest of the code. Red: Buggy code, Blue: Added print statements. For the sake of brevity, we omit demonstrations and some instructions.
  • Figure 3: Illustration of analysing and fixing the bug. Bold: the failed test case with wrong answer and (or) expected answer, Yellow: the log, Purple: the rationales indicating the LLMs found the bug through the log, Red: the founded buggy code, which is also marked in red in Figure \ref{['fig:adding']}, Green: the corrected code. The inconsistencies between test case and log explanation are highlighted by background color.
  • Figure 4: Performance of different debugging methods as the procedure progresses.
  • Figure 5: Number of added print statements.
  • ...and 1 more figures