Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen, Philip Arthur, Qianyu Feng, Cong Duy Vu Hoang, Yu-Heng Hong, Mahdi Kazemi Moghaddam, Omid Nezami, Thien Nguyen, Gioacchino Tangari, Duy Vu, Thanh Vu, Mark Johnson, Krishnaram Kenthapadi, Don Dharmasiri, Long Duong, Yuan-Fang Li
TL;DR
The paper addresses how to build robust CodeLLMs through focused data engineering, emphasizing data synthesis and filtering as critical levers. It offers a taxonomy across model-building phases, core objectives, and specific tasks, and surveys over 50 works from the last two years to ground practical guidance. By contrasting code-specific practices with general data-synthesis literature, it provides a structured roadmap for seed collection, synthetic generation, filtering, and evaluation, plus a public GitHub resource for datasets. The analysis highlights contributions in improving data quality and diversity, enabling better reasoning and iterative programming, and outlines future directions such as learning across low-resource languages and automating data synthesis with agent-based approaches.
Abstract
Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.
