Table of Contents
Fetching ...

PDC & DM-SFT: A Road for LLM SQL Bug-Fix Enhancing

Yiwen Duan, Yonghong Yu, Xiaoming Zhao, Yichang Wu, Wenbo Liu

TL;DR

A suit of methods to enhance LLM's SQL bug-fixing abilities by introducing an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the "disorientation" in SQL code bug-fixing training.

Abstract

Code Large Language Models (Code LLMs), such as Code llama and DeepSeek-Coder, have demonstrated exceptional performance in the code generation tasks. However, most existing models focus on the abilities of generating correct code, but often struggle with bug repair. We introduce a suit of methods to enhance LLM's SQL bug-fixing abilities. The methods are mainly consisted of two parts: A Progressive Dataset Construction (PDC) from scratch and Dynamic Mask Supervised Fine-tuning (DM-SFT). PDC proposes two data expansion methods from the perspectives of breadth first and depth first respectively. DM-SFT introduces an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the "disorientation" in SQL code bug-fixing training. In our evaluation, the code LLM models trained with two methods have exceeds all current best performing model which size is much larger.

PDC & DM-SFT: A Road for LLM SQL Bug-Fix Enhancing

TL;DR

A suit of methods to enhance LLM's SQL bug-fixing abilities by introducing an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the "disorientation" in SQL code bug-fixing training.

Abstract

Code Large Language Models (Code LLMs), such as Code llama and DeepSeek-Coder, have demonstrated exceptional performance in the code generation tasks. However, most existing models focus on the abilities of generating correct code, but often struggle with bug repair. We introduce a suit of methods to enhance LLM's SQL bug-fixing abilities. The methods are mainly consisted of two parts: A Progressive Dataset Construction (PDC) from scratch and Dynamic Mask Supervised Fine-tuning (DM-SFT). PDC proposes two data expansion methods from the perspectives of breadth first and depth first respectively. DM-SFT introduces an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the "disorientation" in SQL code bug-fixing training. In our evaluation, the code LLM models trained with two methods have exceeds all current best performing model which size is much larger.

Paper Structure

This paper contains 16 sections, 5 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The initial training data collection via user behavior logs mining.
  • Figure 2: Execution filter for data quality.
  • Figure 3: Overview of oriented generation method.
  • Figure 4: A comparison of the default generative SFT (top) and dynamic mask SFT (bottom) for the code bug-fixing task.
  • Figure 5: Bug fixing evaluation results with different value of random mask ratio factor $p$.
  • ...and 7 more figures