Code Comparison Tuning for Code Large Language Models
Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu
TL;DR
Code Comparison Tuning (CCT) addresses the sensitivity gap of Code LLMs to subtle code errors by integrating token-level and sequence-level comparisons into instruction tuning. It constructs erroneous code variants and uses a token-level preference loss together with sequence-level templates, blending these with the standard instruction-tuning objective via $\mathcal{L} = \mathcal{L}_{lm} + \alpha \mathcal{L}_{token} + \beta \mathcal{L}_{seq}$ where $\alpha=2.0$ and $\beta=0.5$. Evaluations on the HumanEvalFix benchmark show consistent improvements over standard instruction tuning across open-source backbones, achieving up to about 4-point gains in pass@1. The results demonstrate that explicit comparison signals enhance the model’s ability to detect and repair bugs, with data-efficient performance and clear ablations supporting the contribution of both token- and sequence-level signals.
Abstract
We present Code Comparison Tuning (CCT), a simple and effective tuning method for code large language models (Code LLMs) to better handle subtle code errors. Specifically, we integrate the concept of comparison into instruction tuning, both at the token and sequence levels, enabling the model to discern even the slightest deviations in code. To compare the original code with an erroneous version containing manually added code errors, we use token-level preference loss for detailed token-level comparisons. Additionally, we combine code segments to create a new instruction tuning sample for sequence-level comparisons, enhancing the model's bug-fixing capability. Experimental results on the HumanEvalFix benchmark show that CCT surpasses instruction tuning in pass@1 scores by up to 4 points across diverse code LLMs, and extensive analysis demonstrates the effectiveness of our method.
