Code Comparison Tuning for Code Large Language Models

Yufan Jiang; Qiaozhi He; Xiaomin Zhuang; Zhihua Wu

Code Comparison Tuning for Code Large Language Models

Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu

TL;DR

Code Comparison Tuning (CCT) addresses the sensitivity gap of Code LLMs to subtle code errors by integrating token-level and sequence-level comparisons into instruction tuning. It constructs erroneous code variants and uses a token-level preference loss together with sequence-level templates, blending these with the standard instruction-tuning objective via $\mathcal{L} = \mathcal{L}_{lm} + \alpha \mathcal{L}_{token} + \beta \mathcal{L}_{seq}$ where $\alpha=2.0$ and $\beta=0.5$. Evaluations on the HumanEvalFix benchmark show consistent improvements over standard instruction tuning across open-source backbones, achieving up to about 4-point gains in pass@1. The results demonstrate that explicit comparison signals enhance the model’s ability to detect and repair bugs, with data-efficient performance and clear ablations supporting the contribution of both token- and sequence-level signals.

Abstract

We present Code Comparison Tuning (CCT), a simple and effective tuning method for code large language models (Code LLMs) to better handle subtle code errors. Specifically, we integrate the concept of comparison into instruction tuning, both at the token and sequence levels, enabling the model to discern even the slightest deviations in code. To compare the original code with an erroneous version containing manually added code errors, we use token-level preference loss for detailed token-level comparisons. Additionally, we combine code segments to create a new instruction tuning sample for sequence-level comparisons, enhancing the model's bug-fixing capability. Experimental results on the HumanEvalFix benchmark show that CCT surpasses instruction tuning in pass@1 scores by up to 4 points across diverse code LLMs, and extensive analysis demonstrates the effectiveness of our method.

Code Comparison Tuning for Code Large Language Models

TL;DR

where

and

. Evaluations on the HumanEvalFix benchmark show consistent improvements over standard instruction tuning across open-source backbones, achieving up to about 4-point gains in pass@1. The results demonstrate that explicit comparison signals enhance the model’s ability to detect and repair bugs, with data-efficient performance and clear ablations supporting the contribution of both token- and sequence-level signals.

Abstract

Paper Structure (17 sections, 3 equations, 6 figures, 4 tables)

This paper contains 17 sections, 3 equations, 6 figures, 4 tables.

Introduction
Method
Background: Instruction Tuning
Code Comparison Tuning
Token-level Comparison
Sequence-level Comparison
Overall Training Objective
Experiments
Datasets
Baselines & Settings
Results
Ablation Study
Results on HumanEvalFixDocs
Effect of Corpus Size
Conclusions
...and 2 more sections

Figures (6)

Figure 1: An erroneous bug fix example. Given the code-related issues, users or code language models generate code with bugs. The fine-tuned models tend to introduce additional errors when attempting to fix bugs (red).
Figure 2: The overall framework of our proposed CCT.
Figure 3: Effect of Instruction dataset size. We report pass@1 under different sizes of instructing datasets.
Figure 4: Variable misuse bug example. The buggy code (right) incorrectly uses 'newcode'.
Figure 5: Operator misuse bug example. The buggy code (right) incorrectly uses 'greater than'.
...and 1 more figures

Code Comparison Tuning for Code Large Language Models

TL;DR

Abstract

Code Comparison Tuning for Code Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)