Table of Contents
Fetching ...

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Codefuse, Ling Team, :, Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang, Wenjie Yang, Tong Ye, Hang Yu, Wei Zhang, Zhenduo Zhang, Hailin Zhao, Xunjin Zheng, Jun Zhou

TL;DR

Ling-Coder-Lite introduces a Mixture-of-Experts (MoE) based code LLM designed to deliver high coding performance with strong efficiency. The approach hinges on extensive, high-quality data curation (including source-code, repository-level data, code-related data, and synthetic instructions) and a multi-stage training regime comprising continuous pre-training, annealing, supervised fine-tuning, and Direct Preference Optimization. Empirical results on 12 diverse benchmarks show Ling-Coder-Lite achieving on-par or state-of-the-art performance for models of similar size, while offering substantial practical efficiency gains (notably a ~50% reduction in deployment resources). The work also emphasizes openness, releasing substantial data and the Ling-Coder-Lite models to accelerate research and development in efficient code LLMs with real-world applicability in AI-assisted development environments.

Abstract

Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

TL;DR

Ling-Coder-Lite introduces a Mixture-of-Experts (MoE) based code LLM designed to deliver high coding performance with strong efficiency. The approach hinges on extensive, high-quality data curation (including source-code, repository-level data, code-related data, and synthetic instructions) and a multi-stage training regime comprising continuous pre-training, annealing, supervised fine-tuning, and Direct Preference Optimization. Empirical results on 12 diverse benchmarks show Ling-Coder-Lite achieving on-par or state-of-the-art performance for models of similar size, while offering substantial practical efficiency gains (notably a ~50% reduction in deployment resources). The work also emphasizes openness, releasing substantial data and the Ling-Coder-Lite models to accelerate research and development in efficient code LLMs with real-world applicability in AI-assisted development environments.

Abstract

Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.

Paper Structure

This paper contains 36 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Ling-Coder-Lite achieves an effective trade-off between high performance and efficiency by leveraging high-quality data. (a) A substantial portion of high-quality data used in the Ling-Coder-Lite training process (approximately 30 million samples) has been released as open-source data; (b) Average performance of various code LLMs with similar parameter size on 12 code benchmarks; (c) A comparison of various models over performance (in terms of average evaluation scores) versus the theoretical number of computational operations (in terms of TFLOPs per single inference with a context length of 4096).
  • Figure 2: Data Curation Pipeline for Code Data Utilized by Ling-Coder-Lite.
  • Figure 3: Pipeline for Recalling Code-Related Data from Common Crawl.
  • Figure 4: Pipeline of Bottom-Up Instruction Data Synthesis.
  • Figure 5: Training Pipeline for Ling-Coder-Lite.
  • ...and 1 more figures