Table of Contents
Fetching ...

WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models

Huawen Feng, Pu Zhao, Qingfeng Sun, Can Xu, Fangkai Yang, Lu Wang, Qianli Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

TL;DR

WarriorCoder presents a novel data-economical paradigm where open-source code LLMs compete in expert battles, with impartial judges and an Elo-based global score guiding data collection. By mining instructions from scratch, deduplicating and filtering by difficulty, and selecting diverse training data through embedding-based compression and KCenterGreedy, the approach creates high-quality post-training data without seed datasets or proprietary prompts. Fine-tuning on this data yields state-of-the-art results for models of the same size across multiple code benchmarks (e.g., HumanEval and MBPP) and strong performance on code reasoning and libraries usage, surpassing several proprietary baselines. Ablation and data analyses show the mined data are largely novel, diverse, and task-distributed, supporting WarriorCoder’s claim of scalable, low-cost data flywheels and potential applicability beyond coding tasks.

Abstract

Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to collect complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from a limited set of proprietary LLMs (e.g., Claude, GPT4, and so on), which restricts the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose WarriorCoder, a novel paradigm learns from expert battles to address these limitations. Specifically, we create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges. This competitive framework generates novel training data from scratch, leveraging the strengths of all participants. Experimental results show that WarriorCoder achieves state-of-the-art performance compared to previous models of the same size, even without relying on proprietary LLMs.

WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models

TL;DR

WarriorCoder presents a novel data-economical paradigm where open-source code LLMs compete in expert battles, with impartial judges and an Elo-based global score guiding data collection. By mining instructions from scratch, deduplicating and filtering by difficulty, and selecting diverse training data through embedding-based compression and KCenterGreedy, the approach creates high-quality post-training data without seed datasets or proprietary prompts. Fine-tuning on this data yields state-of-the-art results for models of the same size across multiple code benchmarks (e.g., HumanEval and MBPP) and strong performance on code reasoning and libraries usage, surpassing several proprietary baselines. Ablation and data analyses show the mined data are largely novel, diverse, and task-distributed, supporting WarriorCoder’s claim of scalable, low-cost data flywheels and potential applicability beyond coding tasks.

Abstract

Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to collect complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from a limited set of proprietary LLMs (e.g., Claude, GPT4, and so on), which restricts the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose WarriorCoder, a novel paradigm learns from expert battles to address these limitations. Specifically, we create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges. This competitive framework generates novel training data from scratch, leveraging the strengths of all participants. Experimental results show that WarriorCoder achieves state-of-the-art performance compared to previous models of the same size, even without relying on proprietary LLMs.

Paper Structure

This paper contains 26 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The comparisons between our method and traditional data flywheels. Unlike previous work, we guides the target model to learn from pairwise competitions. No demand for seed datasets, human-generated prompts, or annotations from proprietary models, the target model integrates the strengths of its competitors.
  • Figure 2: The diagram of learning from expert battles. In each round of the arena, the attacker challenges the defender in its area of expertise under the evaluation of judges, and then the winner's response is added to the training data. In this manner, the target model gradually incorporates the strengths of all the code experts by fine-tuning on the data.
  • Figure 3: The overlapping rate between the mined instructions and existing training datasets.
  • Figure 4: The heatmap of win rates of the selected code experts.
  • Figure 5: The proportion of difficulties of mined instructions. As mentioned in Section \ref{['sec:instruction']}, the difficulties of instructions are divided into four levels: excellent (9-10), good(6-8), average(3-5) and poor(1-2).