Table of Contents
Fetching ...

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Weijie Lv, Xuan Xia, Sheng-Jun Huang

TL;DR

CodeACT tackles the dual challenges of data quality and training efficiency in code LLM fine-tuning by marrying Complexity and Diversity Aware Sampling (CDAS) with a Dynamic Pack padding strategy. CDAS selects a diverse, complex subset of the training data using intra-cluster IFD scoring, while Dynamic Pack minimizes padding by concatenating length-sorted samples, yielding substantial speedups and memory savings. Across OSS-Instruct and EVOL-Instruct datasets, CodeACT-enhanced models achieve competitive or superior performance with far less data and markedly reduced compute requirements, including notable gains on HumanEval benchmarks. The work advances practical open-source Code LLM training by reducing resource needs and achieving strong generalization through principled data selection and packing approaches.

Abstract

Large language models (LLMs) have shown great potential in code-related tasks, yet open-source models lag behind their closed-source counterparts. To bridge this performance gap, existing methods generate vast amounts of synthetic data for fine-tuning, leading to inefficiencies in training. Motivated by the need for more effective and efficient training, we propose the Code Adaptive Compute-efficient Tuning (CodeACT) framework. CodeACT introduces the Complexity and Diversity Aware Sampling (CDAS) method to select high-quality training data based on complexity and diversity, and the Dynamic Pack padding strategy to reduce computational resource usage by minimizing padding tokens during training. Experimental results demonstrate that CodeACT-DeepSeek-Coder-6.7B, fine-tuned on only 40% of the EVOL-Instruct data, achieves an 8.6% performance increase on HumanEval, reduces training time by 78%, and decreases peak GPU memory usage by 27%. These findings underscore CodeACT's ability to enhance the performance and efficiency of open-source models. By optimizing both the data selection and training processes, CodeACT offers a comprehensive approach to improving the capabilities of open-source LLMs while significantly reducing computational requirements, addressing the dual challenges of data quality and training efficiency, and paving the way for more resource-efficient and performant models.

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

TL;DR

CodeACT tackles the dual challenges of data quality and training efficiency in code LLM fine-tuning by marrying Complexity and Diversity Aware Sampling (CDAS) with a Dynamic Pack padding strategy. CDAS selects a diverse, complex subset of the training data using intra-cluster IFD scoring, while Dynamic Pack minimizes padding by concatenating length-sorted samples, yielding substantial speedups and memory savings. Across OSS-Instruct and EVOL-Instruct datasets, CodeACT-enhanced models achieve competitive or superior performance with far less data and markedly reduced compute requirements, including notable gains on HumanEval benchmarks. The work advances practical open-source Code LLM training by reducing resource needs and achieving strong generalization through principled data selection and packing approaches.

Abstract

Large language models (LLMs) have shown great potential in code-related tasks, yet open-source models lag behind their closed-source counterparts. To bridge this performance gap, existing methods generate vast amounts of synthetic data for fine-tuning, leading to inefficiencies in training. Motivated by the need for more effective and efficient training, we propose the Code Adaptive Compute-efficient Tuning (CodeACT) framework. CodeACT introduces the Complexity and Diversity Aware Sampling (CDAS) method to select high-quality training data based on complexity and diversity, and the Dynamic Pack padding strategy to reduce computational resource usage by minimizing padding tokens during training. Experimental results demonstrate that CodeACT-DeepSeek-Coder-6.7B, fine-tuned on only 40% of the EVOL-Instruct data, achieves an 8.6% performance increase on HumanEval, reduces training time by 78%, and decreases peak GPU memory usage by 27%. These findings underscore CodeACT's ability to enhance the performance and efficiency of open-source models. By optimizing both the data selection and training processes, CodeACT offers a comprehensive approach to improving the capabilities of open-source LLMs while significantly reducing computational requirements, addressing the dual challenges of data quality and training efficiency, and paving the way for more resource-efficient and performant models.
Paper Structure (27 sections, 5 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 5 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overviw of our proposed CDAS method, including three steps from top to bottom. Step 1: Clustering the EVOL-Instruct dataset to form multiple clusters. Step 2: Computing the Instruction-Following Difficulty score by comparing the model's perplexity with and without instructions. Step 3: Sampling the top m% instances from each re-ranked cluster to form a high-complexity sub-dataset that preserves data diversity. Finally, we use the selected data for fine-tuning to obtain CodeACT-Coder.
  • Figure 2: Illustration of different padding strategies, where the blank squares represent padding tokens. Top: Traditional padding strategy aligns samples to the model's maximum input length, resulting in high computational resource consumption. Middle: Dynamic padding strategy reduces the number of padding tokens by aligning samples to the length of the longest sample in each batch. Bottom: Our proposed Dynamic Pack strategy sorts samples by length and concatenates multiple samples within a batch, further optimizing the utilization of the model's maximum input length and reducing padding tokens.
  • Figure 3: Comparison of sampling rates and their impact on the performance of the DeepSeek-Coder-Base-6.7B model using the OSS-Instruct dataset.
  • Figure 4: Comparison of sampling methods and their impact on the performance of the DeepSeek-Coder-Base-6.7B model using the OSS-Instruct dataset.