Table of Contents
Fetching ...

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

Yejie Wang, Keqing He, Guanting Dong, Pei Wang, Weihao Zeng, Muxi Diao, Yutao Mou, Mengdi Zhang, Jingang Wang, Xunliang Cai, Weiran Xu

TL;DR

DolphCoder targets two key gaps in code LLM instruction tuning: output diversity and reliable evaluation signals. It introduces Diverse Instruction Tuning (DIT) to generate multiple reasoning paths per task and Multi-Objective Instruction Tuning (MOT) to jointly optimize code generation and code evaluation, using Code Llama-PYTHON as the backbone. Empirical results on HumanEval, HumanEval+, and MBPP show DolphCoder achieving strong open-source performance, with substantial gains from both DIT and MOT and favorable comparisons to WizardCoder and CODELLAMA baselines. The approach demonstrates that diverse targeted instructions plus explicit evaluation signals can significantly enhance code correctness and robustness, offering a practical path toward more reliable code LLMs in research settings.

Abstract

Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Several instruction tuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this paper, we introduce a diverse instruction model (DolphCoder) with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks, demonstrating new insights for future code instruction tuning work. Our key findings are: (1) Augmenting more diverse responses with distinct reasoning paths increases the code capability of LLMs. (2) Improving one's ability to evaluate the correctness of code solutions also enhances their ability to create it.

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

TL;DR

DolphCoder targets two key gaps in code LLM instruction tuning: output diversity and reliable evaluation signals. It introduces Diverse Instruction Tuning (DIT) to generate multiple reasoning paths per task and Multi-Objective Instruction Tuning (MOT) to jointly optimize code generation and code evaluation, using Code Llama-PYTHON as the backbone. Empirical results on HumanEval, HumanEval+, and MBPP show DolphCoder achieving strong open-source performance, with substantial gains from both DIT and MOT and favorable comparisons to WizardCoder and CODELLAMA baselines. The approach demonstrates that diverse targeted instructions plus explicit evaluation signals can significantly enhance code correctness and robustness, offering a practical path toward more reliable code LLMs in research settings.

Abstract

Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Several instruction tuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this paper, we introduce a diverse instruction model (DolphCoder) with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks, demonstrating new insights for future code instruction tuning work. Our key findings are: (1) Augmenting more diverse responses with distinct reasoning paths increases the code capability of LLMs. (2) Improving one's ability to evaluate the correctness of code solutions also enhances their ability to create it.
Paper Structure (22 sections, 6 figures, 7 tables)

This paper contains 22 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The overall architecture of our proposed diverse instruction tuning with self-evaluating for code generation, DolphCoder. Stage (a) denotes Diverse Instruction Tuning (DIT) and Stage (b) denotes Multi-Objective Instruction Tuning (MOT) for self-evaluating.
  • Figure 2: We use these system prompts to generate more diverse responses where [EMPTY] means no system prompt.
  • Figure 3: We use the evaluation prompt to query GPT-4 to access the correctness of the generated code solutions of our model.
  • Figure 4: Inference prompt when testing on HumanEval and MBPP.
  • Figure 5: The trend of code evaluation capability and code generation capability during the MOT stage where step 200 refers to the training step in the first step of MOT. Init model means DIT model and Final means DolphCoder. Pass@1 refer to pass@1 on HumanEval.
  • ...and 1 more figures