Table of Contents
Fetching ...

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang

Abstract

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

Embarrassingly Simple Self-Distillation Improves Code Generation

Abstract

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

Paper Structure

This paper contains 81 sections, 6 theorems, 50 equations, 15 figures, 5 tables.

Key Result

Lemma B.1

For any distribution $p$ over $\mathcal{V}$ and temperatures $T_1, T_2 > 0$, The same holds when $p$ is replaced by any restriction to a fixed support set $S \subseteq \mathcal{V}$. $\blacktriangleleft$$\blacktriangleleft$

Figures (15)

  • Figure 1: Simple self-distillation (SSD) is embarrassingly simple, yet yields substantial LiveCodeBench v6 gains across five models spanning two families, three scales, with both instruct and thinking variants. Left: SSD samples from the base model with training-time decoding temperature $T_{\textsf{train}}$, fine-tunes on its own raw outputs, and then decodes at evaluation time with $T_{\textsf{eval}}$; it uses no RL, verifier, teacher, or code execution environment. Right: LiveCodeBench v6 pass@1 for Qwen3-4B-Instruct and Qwen3-30B-Instruct on the Overall, Medium, and Hard splits (orange = 4B, blue = 30B; hatched = base, solid = +SSD). The footer highlights the broader pattern: all five evaluated models improve, Qwen3-30B-Instruct gains +30% relative pass@1, and the largest gains occur on harder problems.
  • Figure 2: SSD outperforms the best point in the evaluated base-model decoding sweep within standard global decoding policies. Each panel shows one model (30B-Instruct, 4B-Instruct, 4B-Thinking) and one metric (pass@1 or pass@5); amber curves sweep base-model evaluation temperature while blue horizontal lines mark SSD results from \ref{['tab:main_results_grouped']}. Solid shading marks the margin over all problems; outlined (dashed-border) shading marks the margin on hard problems.
  • Figure 3: Training and evaluation temperatures compose through a broad effective-temperature band, while truncation raises the achievable pass@1 within that band.(a) Representative Qwen3-30B-Instruct sweeps on LCB v6 against $T_{\textsf{eff}} = T_{\textsf{train}}T_{\textsf{eval}}$: gray = no truncation, amber/green = truncated training-time sampling. Dots are runs, curves are quadratic fits, and the dashed line marks the 42.4% baseline. (b) Qwen3-4B-Thinking on LCB v6 with truncation, shown as best pass@1 across iterations over $(T_{\textsf{train}}, T_{\textsf{eval}})$.
  • Figure 4: A single evaluation temperature cannot satisfy both exploration at forks and precision at locks.Left: a sorting example in which the algorithm-choice token is a fork position (rust-orange), while the later uses of mid are lock positions (blue); gray ghost branches indicate other valid algorithms that could have been taken at the fork. Right: token distributions for the same two context types under low and high $T_{\textsf{eval}}$, with head and tail mass shown explicitly. Low $T_{\textsf{eval}}$ keeps the lock precise but collapses the fork's viable head (low exploration); high $T_{\textsf{eval}}$ restores exploration at the fork but revives the lock's distractor tail (low precision).
  • Figure 5: SSD turns forks into plateaus and locks into spikes. Tokens are ranked by probability. Hatched bars and dashed curves show the base model; solid bars and solid curves show the model after SSD; the red dashed cutoff marks the support retained during SSD. (a) Fork-like state: the diffuse tail is trimmed, but several top continuations remain and become more evenly weighted, forming a broad plateau over viable branches. (b) Lock-like state: the same rule trims the tail much more aggressively and concentrates mass on the dominant token, producing a sharper spike.
  • ...and 10 more figures

Theorems & Definitions (9)

  • Lemma B.1: Temperatures compose multiplicatively
  • proof
  • Proposition B.2: Evaluation-time form under local ideal fit
  • Proposition B.3: Local gain decomposition under local ideal fit
  • Remark B.4: Two limiting cases
  • Proposition B.5: Normal form of decode-only policies
  • proof
  • Corollary B.6: Prefix rigidity
  • Corollary B.7: Power rigidity