Does Few-Shot Learning Help LLM Performance in Code Synthesis?
Derek Xu, Tong Xie, Botao Xia, Haoyu Li, Yunsheng Bai, Yizhou Sun, Wei Wang
TL;DR
The paper investigates how few-shot demonstrations in code-generation prompts affect LLM performance and introduces two prompt-ranking strategies, CODEEXEMPLAR-FREE (model-free) and CODEEXEMPLAR-BASE (model-based), to select the most informative examples from large pools under token-cost constraints. Both methods operate with gray-box access to model logits and rely on a fine-grained surrogate, $PP_{target}$, linked to the downstream metric $Pass@1$, with embedding signals enabling the model-based ranker. Empirical results on CodeLlama across the HumanEval+ benchmark show meaningful improvements of about $5\%$ in Pass$@1$ and substantial perplexity reductions, validating the effectiveness of informative few-shot example selection for code synthesis. The study also reveals a distribution-shift bottleneck, emphasizing the need for better prompt and dataset design to maximize generalization, even without changes to model weights. Overall, the findings demonstrate that thoughtful prompt design can yield practical code-generation gains, guiding future work toward richer, higher-quality coding datasets and broader model evaluations.
Abstract
Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM's coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama's coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.
