Table of Contents
Fetching ...

FGIT: Fault-Guided Fine-Tuning for Code Generation

Lishui Fan, Zhongxin Liu, Haoye Wang, Lingfeng Bao, Xin Xia, Shanping Li

TL;DR

This paper addresses the gap in standard instruction-tuning where large language models can produce plausible but incorrect code. It proposes Fault-Guided Fine-Tuning (FGit), a two-part approach that first identifies error-sensitive segments by generating and selecting similar incorrect variants with a teacher model and annotating multi-granularity differences, then applies dynamic loss reweighting to emphasize these segments during training. Across seven LLMs and three code-generation benchmarks, FGIt yields a notable improvement in pass@1 and demonstrates strong generalization to both open and closed-source instruction data, with ablations confirming the value of multi-granularity differences and dynamic weighting. The method enhances the model’s ability to distinguish correct implementations from near-miss errors, offering a practical path to higher-quality, functionally correct code generation in diverse settings.

Abstract

Modern instruction-tuned large language models (LLMs) have made remarkable progress in code generation. However, these LLMs fine-tuned with standard supervised fine-tuning (SFT) sometimes generate plausible-looking but functionally incorrect code variants. This issue likely stems from the limitation of standard SFT, which treats all tokens equally during optimization and fails to emphasize the error-sensitive segments-specific code differences between correct implementations and similar incorrect variants. To address this problem, we propose Fault-Guided Fine-Tuning (FGIT), a novel fine-tuning technique that enhances LLMs' code generation by (1) extracting multi-granularity (line/token-level) differences between correct and incorrect yet similar implementations to identify error-sensitive segments, and (2) dynamically prioritizing those segments during training via dynamic loss weighting. Through extensive experiments on seven LLMs across three widely-used benchmarks, our method achieves an average relative improvement of 6.9% on pass@1 with some enhanced 6.7B LLMs outperforming closed-source models, e.g., GPT-3.5-Turbo. Furthermore, our fine-tuning technique demonstrates strong generalization with performance improvements ranging from 3.8% to 19.1% across diverse instruction-tuned LLMs, and our ablation studies confirm the contributions of different granularities of differences and hyperparameters.

FGIT: Fault-Guided Fine-Tuning for Code Generation

TL;DR

This paper addresses the gap in standard instruction-tuning where large language models can produce plausible but incorrect code. It proposes Fault-Guided Fine-Tuning (FGit), a two-part approach that first identifies error-sensitive segments by generating and selecting similar incorrect variants with a teacher model and annotating multi-granularity differences, then applies dynamic loss reweighting to emphasize these segments during training. Across seven LLMs and three code-generation benchmarks, FGIt yields a notable improvement in pass@1 and demonstrates strong generalization to both open and closed-source instruction data, with ablations confirming the value of multi-granularity differences and dynamic weighting. The method enhances the model’s ability to distinguish correct implementations from near-miss errors, offering a practical path to higher-quality, functionally correct code generation in diverse settings.

Abstract

Modern instruction-tuned large language models (LLMs) have made remarkable progress in code generation. However, these LLMs fine-tuned with standard supervised fine-tuning (SFT) sometimes generate plausible-looking but functionally incorrect code variants. This issue likely stems from the limitation of standard SFT, which treats all tokens equally during optimization and fails to emphasize the error-sensitive segments-specific code differences between correct implementations and similar incorrect variants. To address this problem, we propose Fault-Guided Fine-Tuning (FGIT), a novel fine-tuning technique that enhances LLMs' code generation by (1) extracting multi-granularity (line/token-level) differences between correct and incorrect yet similar implementations to identify error-sensitive segments, and (2) dynamically prioritizing those segments during training via dynamic loss weighting. Through extensive experiments on seven LLMs across three widely-used benchmarks, our method achieves an average relative improvement of 6.9% on pass@1 with some enhanced 6.7B LLMs outperforming closed-source models, e.g., GPT-3.5-Turbo. Furthermore, our fine-tuning technique demonstrates strong generalization with performance improvements ranging from 3.8% to 19.1% across diverse instruction-tuned LLMs, and our ablation studies confirm the contributions of different granularities of differences and hyperparameters.

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Llama-3.1-70B-Instruct sometimes makes mistakes in error-sensitive segments in the outputs.
  • Figure 2: The overview of FGit, taking one sample for explanation.
  • Figure 3: The prompt for generating similar yet incorrect response.
  • Figure 4: A case demonstrating how LLMs after FGit can better focus on error-sensitive segments to generate the correct solution.
  • Figure 5: A case demonstrating how FGit can improve overall code generation performance.
  • ...and 3 more figures