Table of Contents
Fetching ...

MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning

Bingchang Liu, Chaoyu Chen, Cong Liao, Zi Gong, Huan Wang, Zhichao Lei, Ming Liang, Dajun Chen, Min Shen, Hailian Zhou, Hang Yu, Jianguo Li

TL;DR

MFTCoder tackles the inefficiency of fine-tuning Code LLMs for individual tasks by introducing a multitask fine-tuning framework that explicitly balances diverse tasks and convergence speeds. It combines instruction data generation, efficient tokenization, and parameter-efficient fine-tuning (LoRA/QLoRA) with specialized loss functions to address data imbalance and task heterogeneity. Across five code-oriented tasks and multiple base models, MFT demonstrates superior performance to single-task and mixed-task baselines, with notable results on Humaneval and Text-to-SQL generalization; a CodeLlama-34B variant even surpasses GPT-4 on several benchmarks. The framework is open-sourced as MFTCoder, enabling scalable, efficient fine-tuning of large Code LLMs for broader deployment and research applications.

Abstract

Code LLMs have emerged as a specialized research field, with remarkable studies dedicated to enhancing model's coding capabilities through fine-tuning on pre-trained models. Previous fine-tuning approaches were typically tailored to specific downstream tasks or scenarios, which meant separate fine-tuning for each task, requiring extensive training resources and posing challenges in terms of deployment and maintenance. Furthermore, these approaches failed to leverage the inherent interconnectedness among different code-related tasks. To overcome these limitations, we present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks. By incorporating various loss functions, we effectively address common challenges in multi-task learning, such as data imbalance, varying difficulty levels, and inconsistent convergence speeds. Extensive experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks. Moreover, MFTcoder offers efficient training capabilities, including efficient data tokenization modes and PEFT fine-tuning, resulting in significantly improved speed compared to traditional fine-tuning methods. MFTcoder seamlessly integrates with several mainstream open-source LLMs, such as CodeLLama and Qwen. Leveraging the CodeLLama foundation, our MFTcoder fine-tuned model, \textsc{CodeFuse-CodeLLama-34B}, achieves an impressive pass@1 score of 74.4\% on the HumaneEval benchmark, surpassing GPT-4 performance (67\%, zero-shot). MFTCoder is open-sourced at \url{https://github.com/codefuse-ai/MFTCOder}

MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning

TL;DR

MFTCoder tackles the inefficiency of fine-tuning Code LLMs for individual tasks by introducing a multitask fine-tuning framework that explicitly balances diverse tasks and convergence speeds. It combines instruction data generation, efficient tokenization, and parameter-efficient fine-tuning (LoRA/QLoRA) with specialized loss functions to address data imbalance and task heterogeneity. Across five code-oriented tasks and multiple base models, MFT demonstrates superior performance to single-task and mixed-task baselines, with notable results on Humaneval and Text-to-SQL generalization; a CodeLlama-34B variant even surpasses GPT-4 on several benchmarks. The framework is open-sourced as MFTCoder, enabling scalable, efficient fine-tuning of large Code LLMs for broader deployment and research applications.

Abstract

Code LLMs have emerged as a specialized research field, with remarkable studies dedicated to enhancing model's coding capabilities through fine-tuning on pre-trained models. Previous fine-tuning approaches were typically tailored to specific downstream tasks or scenarios, which meant separate fine-tuning for each task, requiring extensive training resources and posing challenges in terms of deployment and maintenance. Furthermore, these approaches failed to leverage the inherent interconnectedness among different code-related tasks. To overcome these limitations, we present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks. By incorporating various loss functions, we effectively address common challenges in multi-task learning, such as data imbalance, varying difficulty levels, and inconsistent convergence speeds. Extensive experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks. Moreover, MFTcoder offers efficient training capabilities, including efficient data tokenization modes and PEFT fine-tuning, resulting in significantly improved speed compared to traditional fine-tuning methods. MFTcoder seamlessly integrates with several mainstream open-source LLMs, such as CodeLLama and Qwen. Leveraging the CodeLLama foundation, our MFTcoder fine-tuned model, \textsc{CodeFuse-CodeLLama-34B}, achieves an impressive pass@1 score of 74.4\% on the HumaneEval benchmark, surpassing GPT-4 performance (67\%, zero-shot). MFTCoder is open-sourced at \url{https://github.com/codefuse-ai/MFTCOder}
Paper Structure (35 sections, 4 equations, 7 figures, 12 tables)

This paper contains 35 sections, 4 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Overview of MFTCoder framework.
  • Figure 2: Data Generation Approach for Code Exercises Datasets using Single-turn Conversation Scheme.
  • Figure 3: Illustration of the differences in sample organization within a batch between normal SFT, dynmaic padding and Pack SFT tokenization modes. The light-colored squares in the figure represent the Prompt section of the samples, while the dark-colored squares represent the Label section (participating in loss calculation). The blank squares represent padding section.
  • Figure 4:
  • Figure 5: Radar Chart of CodeFuse-CodeLlama-34B Model on HumanEval, HumanEval-X, MBPP, DS-1000, and codefuseEval benchmarks compared to GPT-3.5 and GPT-4.
  • ...and 2 more figures