Table of Contents
Fetching ...

Does Few-Shot Learning Help LLM Performance in Code Synthesis?

Derek Xu, Tong Xie, Botao Xia, Haoyu Li, Yunsheng Bai, Yizhou Sun, Wei Wang

TL;DR

The paper investigates how few-shot demonstrations in code-generation prompts affect LLM performance and introduces two prompt-ranking strategies, CODEEXEMPLAR-FREE (model-free) and CODEEXEMPLAR-BASE (model-based), to select the most informative examples from large pools under token-cost constraints. Both methods operate with gray-box access to model logits and rely on a fine-grained surrogate, $PP_{target}$, linked to the downstream metric $Pass@1$, with embedding signals enabling the model-based ranker. Empirical results on CodeLlama across the HumanEval+ benchmark show meaningful improvements of about $5\%$ in Pass$@1$ and substantial perplexity reductions, validating the effectiveness of informative few-shot example selection for code synthesis. The study also reveals a distribution-shift bottleneck, emphasizing the need for better prompt and dataset design to maximize generalization, even without changes to model weights. Overall, the findings demonstrate that thoughtful prompt design can yield practical code-generation gains, guiding future work toward richer, higher-quality coding datasets and broader model evaluations.

Abstract

Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM's coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama's coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.

Does Few-Shot Learning Help LLM Performance in Code Synthesis?

TL;DR

The paper investigates how few-shot demonstrations in code-generation prompts affect LLM performance and introduces two prompt-ranking strategies, CODEEXEMPLAR-FREE (model-free) and CODEEXEMPLAR-BASE (model-based), to select the most informative examples from large pools under token-cost constraints. Both methods operate with gray-box access to model logits and rely on a fine-grained surrogate, , linked to the downstream metric , with embedding signals enabling the model-based ranker. Empirical results on CodeLlama across the HumanEval+ benchmark show meaningful improvements of about in Pass and substantial perplexity reductions, validating the effectiveness of informative few-shot example selection for code synthesis. The study also reveals a distribution-shift bottleneck, emphasizing the need for better prompt and dataset design to maximize generalization, even without changes to model weights. Overall, the findings demonstrate that thoughtful prompt design can yield practical code-generation gains, guiding future work toward richer, higher-quality coding datasets and broader model evaluations.

Abstract

Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM's coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama's coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview or MetaNet Ranker and Perplexity Ranker. We use a ranking algorithm, $f_p$, to select the top $N=2$ examples to form a prompt that is fed to the LLM testing its coding capabilities.
  • Figure 2: Studying the performance gain (i.e. $PP_{target}(y_m^{(j)}; f_m, \hat{D}_j)-PP_{target}(y_m^{(j)}; f_m, \emptyset)$) from adding multiple examples, $\tilde{D}_j$ to each prompt. Each example is chosen randomly from a larger pool of candidate examples, $\hat{D}_j$. Deviations are computed across different prompts.
  • Figure 3: t-SNE Plots of the hidden representation of the prompt with given examples, $h_b^{(j,i)}$. Colors denote the change in Perplexity (Target). As shown, the embeddings naturally encode the Perplexity (Target) score.
  • Figure 4: Correlation between Perplexity (Source) and Perplexity (Target) on some randomly chosen examples across some randomly chosen problems.
  • Figure 5: Main Results using Perplexity (Target). We compare the perplexity improvement gained from adding a $N$ examples to the prompt, where each example is chosen by a different $f_p$ ranker function.
  • ...and 3 more figures