Brevity is the soul of wit: Pruning long files for code generation
Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S. Morcos
TL;DR
The paper investigates data pruning for code-generation LLM fine-tuning and reveals that a simple heuristic—pruning the longest files—can outperform embedding-based pruning in compute-limited settings, delivering both training efficiency and performance gains on HumanEval. It demonstrates that long files are often high in tokens but low in useful signal, and that pruning them can shift the data mix toward more informative content. However, benefits diminish with increased compute and can raise perplexities on held-out long code, underscoring the need for diverse evaluation and careful consideration of downstream use cases. The work advocates for broader data-pruning intuitions, bootstrapped evaluation, and strategies to obtain useful long-context code data, while acknowledging limitations and domain specificity to code generation. Overall, it provides practical guidance on length-based pruning and highlights open questions about long-context data quality and evaluation.
Abstract
Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic--pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.
