An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

Thai Tang Quoc; Duc Ha Minh; Tho Quan Thanh; Anh Nguyen-Duc

An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

Thai Tang Quoc, Duc Ha Minh, Tho Quan Thanh, Anh Nguyen-Duc

TL;DR

CoT-SelfEvolve iteratively and automatically refines code through a self-correcting process, guided by a chain of thought constructed from real-world programming problem feedback, providing a practical solution for improving LLM-based code generation.

Abstract

Large Language Models (LLMs) have recently advanced many applications on software engineering tasks, particularly the potential for code generation. Among contemporary challenges, code generated by LLMs often suffers from inaccuracies and hallucinations, requiring external inputs to correct. One recent strategy to fix these issues is to refine the code generated from LLMs using the input from the model itself (self-augmented). In this work, we proposed a novel method, namely CoT-SelfEvolve. CoT-SelfEvolve iteratively and automatically refines code through a self-correcting process, guided by a chain of thought constructed from real-world programming problem feedback. Focusing on data science code, including Python libraries such as NumPy and Pandas, our evaluations on the DS-1000 dataset demonstrate that CoT-SelfEvolve significantly outperforms existing models in solving complex problems. The framework shows substantial improvements in both initial code generation and subsequent iterations, with the model's accuracy increasing significantly with each additional iteration. This highlights the effectiveness of using chain-of-thought prompting to address complexities revealed by program executor traceback error messages. We also discuss how CoT-SelfEvolve can be integrated into continuous software engineering environments, providing a practical solution for improving LLM-based code generation.

An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

TL;DR

Abstract

Paper Structure (30 sections, 7 figures, 3 tables)

This paper contains 30 sections, 7 figures, 3 tables.

Introduction
Related Work
LLMs and Automated Program Repair
Improving the performance of LLMs
Via Learning from Human Feedback
Via Learning with Automated Feedback
Self-correcting LLMs
Self-correcting LLMs in APR
Proposed approach
Evaluations
Experimental settings
Benchmark data
External Knowledge Base
Evaluation Metrics
Results
...and 15 more sections

Figures (7)

Figure 1: The architecture of CoT-SelfEvolve framework
Figure 2: (RQ1) Comparing performance results for SelfEvolve and CoT-SelfEvolve across different libraries.
Figure 3: (RQ1) DS-1000 average performance across various LLMs. (%)
Figure 4: (RQ3) Cumulative number of problems reaching the stop condition at different attempts where $\text{n=5}$.
Figure 5: (RQ3) Number of prompt tokens and completion tokens for different max attempts $n$.
...and 2 more figures

An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

TL;DR

Abstract

An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)