Table of Contents
Fetching ...

Brevity is the soul of wit: Pruning long files for code generation

Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S. Morcos

TL;DR

The paper investigates data pruning for code-generation LLM fine-tuning and reveals that a simple heuristic—pruning the longest files—can outperform embedding-based pruning in compute-limited settings, delivering both training efficiency and performance gains on HumanEval. It demonstrates that long files are often high in tokens but low in useful signal, and that pruning them can shift the data mix toward more informative content. However, benefits diminish with increased compute and can raise perplexities on held-out long code, underscoring the need for diverse evaluation and careful consideration of downstream use cases. The work advocates for broader data-pruning intuitions, bootstrapped evaluation, and strategies to obtain useful long-context code data, while acknowledging limitations and domain specificity to code generation. Overall, it provides practical guidance on length-based pruning and highlights open questions about long-context data quality and evaluation.

Abstract

Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic--pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.

Brevity is the soul of wit: Pruning long files for code generation

TL;DR

The paper investigates data pruning for code-generation LLM fine-tuning and reveals that a simple heuristic—pruning the longest files—can outperform embedding-based pruning in compute-limited settings, delivering both training efficiency and performance gains on HumanEval. It demonstrates that long files are often high in tokens but low in useful signal, and that pruning them can shift the data mix toward more informative content. However, benefits diminish with increased compute and can raise perplexities on held-out long code, underscoring the need for diverse evaluation and careful consideration of downstream use cases. The work advocates for broader data-pruning intuitions, bootstrapped evaluation, and strategies to obtain useful long-context code data, while acknowledging limitations and domain specificity to code generation. Overall, it provides practical guidance on length-based pruning and highlights open questions about long-context data quality and evaluation.

Abstract

Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic--pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.
Paper Structure (12 sections, 5 figures, 1 table)

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Cumulative distribution of document lengths in the Python subset of the Stack dataset kocetkov2022stack. Documents are ordered on the x-axis in increasing length order; y-axis shows fraction of whole dataset accounted for. Green lines highlight that the 2% of longest files account for the last 20% of tokens. Right shows an example long file (the second longest file). Long files tend to have very little useful signal, being full of large data arrays or "spaghetti code".
  • Figure 2: Visualizations of StarEncoder embedding space. a) Cosine similarity between every pair of cluster centroids. Oddly, the first 37 cluster centroids are all very close to each other. b) We visualize each document as an $(x,y)$ pair where the $x$-axis is the distance to nearest cluster centroid (computed as one minus the cosine distance), and the $y$-axis is the length of the file in tokens (using the Llama tokenizer). We immediately see the reason for the 37 super close clusters -- StarEncoder appears to tightly pack most documents below a certain length (bottom-left cloud of points), while longer documents tend to be more spread out in embedding space (top-right cloud of points). We also highlight the ten longest files in the dataset in red.
  • Figure 3: Results across baselines and length pruning. All plots show the same 5 conditions: no pruning ("none") and the best performing (embedding-based) method from yang2023scip ("SCIP") as baselines, and then pruning $P$% of tokens from the longest documents (our method) for $P=10, 20, 50$. Leftmost column shows performance after roughly 1 epoch for each condition, which can be used to look at training efficiency. We note that aggressive length pruning ($P=50$%) matches the performance of no pruning, indicating a 2x efficiency improvement. The latter two columns show compute-controlled experiments, where all conditions are trained on the same number of tokens (but possibly more epochs, e.g., the 50% length pruning run will have seen twice as many epochs as no pruning). At 8000 steps (middle column), aggressive length pruning still seems to outperform no pruning and SCIP baselines, but this advantage diminishes at the larger compute scale (16000 steps, rightmost column). Error bars indicate standard error.
  • Figure 4: Raw curves of evaluation benchmark performance over training. Shaded area indicates standard error. Note the results in Figure \ref{['fig:main_result']} are taken from this plot, to more clearly show snapshots at various points during training.
  • Figure 5: Two views on validation perplexity evolution on various length bins. a) We show the curves throughout training with each panel showing a different length bin. It's clear that the most aggressively pruned setting (red curve, Prune longest 50%) overfits with multiple epochs (around 10k steps) on all lengths, and generally has worse perplexity on longer length documents---intuitively this makes sense since the longest file in the 50% pruned subset was only $\approx3600$ tokens. Mirroring our results from Figure \ref{['fig:main_result']}, this most aggressive pruning setting performs best in the compute-limited regime (up to 8k steps) on shorter documents (more akin to the downstream benchmarks HumanEval and MBPP). b) We instead focus on the perplexities across length bins, with each color showing a different step. Considering any time step of the "no pruning" baseline, we see that perplexities decrease with longer files, up to the context length (black dotted line), as we'd expect since there's more "useful" context enabling the model to lower perplexity. Beyond the context length, perplexities increase as expected (since all documents are being cut somewhere in the middle). Surprisingly, perplexities again decrease on the longest documents, seeming to indicate that being split isn't affecting these documents. This is in line with our qualitative observations of repetitiveness in Section \ref{['sec:motivation']}, and further motivates pruning these files.