Table of Contents
Fetching ...

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Zifan Song, Yudong Wang, Wenwei Zhang, Kuikun Liu, Chengqi Lyu, Demin Song, Qipeng Guo, Hang Yan, Dahua Lin, Kai Chen, Cairong Zhao

TL;DR

AlchemistCoder tackles the limited diversity of fine-tuning data for open-source Code LLMs by integrating multi-source datasets and harmonizing them with data-specific AlchemistPrompts guided by hindsight relabeling. It further enriches training with code comprehension tasks (instruction evolution, data filtering, code review) and rigorous data cleaning to produce a Harmonized AlchemistCoder dataset (~200M tokens). The approach yields strong performance on code-generation benchmarks, outperforming size-matched baselines and rivaling larger models, while improving generalization on MMLU, BBH, and GSM8K. This demonstrates a practical, cost-effective path to more capable and generalist Code LLMs, although it relies on GPT-4 for prompt design and prompts future work toward open-source prompt generation and broader data sources.

Abstract

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

TL;DR

AlchemistCoder tackles the limited diversity of fine-tuning data for open-source Code LLMs by integrating multi-source datasets and harmonizing them with data-specific AlchemistPrompts guided by hindsight relabeling. It further enriches training with code comprehension tasks (instruction evolution, data filtering, code review) and rigorous data cleaning to produce a Harmonized AlchemistCoder dataset (~200M tokens). The approach yields strong performance on code-generation benchmarks, outperforming size-matched baselines and rivaling larger models, while improving generalization on MMLU, BBH, and GSM8K. This demonstrates a practical, cost-effective path to more capable and generalist Code LLMs, although it relies on GPT-4 for prompt design and prompts future work toward open-source prompt generation and broader data sources.

Abstract

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.
Paper Structure (30 sections, 22 figures, 3 tables)

This paper contains 30 sections, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Performance scatter plot (top right is better) of open-source models on mainstream code benchmarks, HumanEval and MBPP. Our AlchemistCoder series demonstrates astonishing performance across all open-source Code LLMs.
  • Figure 2: Overview for developing AlchemistCoder series. We first integrate high-quality open-source data (a) and conduct data evolution based on them (b). Then, we adopt AlchemistPrompt to harmonize their inherent conflicts (c) and construct code comprehension data (d). We use a mix of these data to fine-tune various pre-trained LLMs to obtain our AlchemistCoder models.
  • Figure 3: Examples of conflicts (e.g., various styles and quality) within multi-source code corpora.
  • Figure 4: Detailed prompt designed for generating data-specific AlchemistPrompts.
  • Figure 5: Data distribution analysis of our AlchemistCoder dataset. The outer and inner circular diagrams respectively display the distributions of data composition and programming languages. Data from AlchemistPrompts and code comprehension tasks, constituting only 8% of the total data, plays a crucial role in harmonizing and polishing the fine-tuning data.
  • ...and 17 more figures