Table of Contents
Fetching ...

Mastering the Craft of Data Synthesis for CodeLLMs

Meng Chen, Philip Arthur, Qianyu Feng, Cong Duy Vu Hoang, Yu-Heng Hong, Mahdi Kazemi Moghaddam, Omid Nezami, Thien Nguyen, Gioacchino Tangari, Duy Vu, Thanh Vu, Mark Johnson, Krishnaram Kenthapadi, Don Dharmasiri, Long Duong, Yuan-Fang Li

TL;DR

The paper addresses how to build robust CodeLLMs through focused data engineering, emphasizing data synthesis and filtering as critical levers. It offers a taxonomy across model-building phases, core objectives, and specific tasks, and surveys over 50 works from the last two years to ground practical guidance. By contrasting code-specific practices with general data-synthesis literature, it provides a structured roadmap for seed collection, synthetic generation, filtering, and evaluation, plus a public GitHub resource for datasets. The analysis highlights contributions in improving data quality and diversity, enabling better reasoning and iterative programming, and outlines future directions such as learning across low-resource languages and automating data synthesis with agent-based approaches.

Abstract

Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

Mastering the Craft of Data Synthesis for CodeLLMs

TL;DR

The paper addresses how to build robust CodeLLMs through focused data engineering, emphasizing data synthesis and filtering as critical levers. It offers a taxonomy across model-building phases, core objectives, and specific tasks, and surveys over 50 works from the last two years to ground practical guidance. By contrasting code-specific practices with general data-synthesis literature, it provides a structured roadmap for seed collection, synthetic generation, filtering, and evaluation, plus a public GitHub resource for datasets. The analysis highlights contributions in improving data quality and diversity, enabling better reasoning and iterative programming, and outlines future directions such as learning across low-resource languages and automating data synthesis with agent-based approaches.

Abstract

Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

Paper Structure

This paper contains 24 sections, 2 figures.

Figures (2)

  • Figure 1: Practical guidance for the code related data generation pipeline. We also recommend several large language models (LLMs) that offer strong performance while maintaining a balanced cost (Appendix \ref{['sec:llm_section']}).
  • Figure 2: Taxonomy of data synthesis and filtering techniques for code-related tasks.