The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-Based Code Generation
Yingjie Fu, Bozhou Li, Linyi Li, Wentao Zhang, Tao Xie
TL;DR
This paper presents the first comprehensive study of example-based code generation using an iterative I/O-examples framework, formalizing two sub-objectives: conforming to given I/O examples (O1) and implementing the target functionality (O2). It introduces the InterCode benchmark with 172 tasks in C#, and an iterative evaluation workflow that adapts I/O examples across turns to resolve ambiguities. The results show that restricting requirements to I/O examples causes substantial performance drops (over 60%), with most successes occurring in the first iteration, though augmenting I/O with imperfect natural language can improve scores. Overall, the work highlights current limitations of LLMs in multi-turn, I/O-driven code generation and offers practical guidance on prompt design and evaluation to advance this capability.
Abstract
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. Alternatively, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs' capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. We adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities. The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs' score decreases by over 60%, and the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations. Furthermore, we also find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization.
