Table of Contents
Fetching ...

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, An Fu

TL;DR

CodeIF-Bench addresses the gap in evaluating instruction-following by LLMs during interactive code generation. It introduces verifiable instruction strategies, a dual conversation regime (Static and Dynamic), and a test-driven evaluation framework that measures IA, CA, IFR, and CIF across 6 LLMs. The study reveals that increasing repository context and dialogue history degrades instruction-following, while prompting strategies that manage context (notably CI and CoT) can improve robustness and efficiency. These findings point to context management as a critical direction for advancing interactive coding with LLMs and have practical implications for real-world development workflows.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs' abilities to generate code that strictly follows users' instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both Static Conversation and Dynamic Conversation settings, we evaluate the performance of 6 state-of-the-art LLMs and summarize the important factors, additional repository context and gradually increasing interaction history influencing the instruction-following ability of LLMs in multi-turn interactions. Furthermore, we identify the potential direction for improvement: context management. The code and data are available at \href{https://github.com/zhu-zhu-ding/CodeIF-Bench}{https://github.com/zhu-zhu-ding/CodeIF-Bench}.

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

TL;DR

CodeIF-Bench addresses the gap in evaluating instruction-following by LLMs during interactive code generation. It introduces verifiable instruction strategies, a dual conversation regime (Static and Dynamic), and a test-driven evaluation framework that measures IA, CA, IFR, and CIF across 6 LLMs. The study reveals that increasing repository context and dialogue history degrades instruction-following, while prompting strategies that manage context (notably CI and CoT) can improve robustness and efficiency. These findings point to context management as a critical direction for advancing interactive coding with LLMs and have practical implications for real-world development workflows.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs' abilities to generate code that strictly follows users' instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both Static Conversation and Dynamic Conversation settings, we evaluate the performance of 6 state-of-the-art LLMs and summarize the important factors, additional repository context and gradually increasing interaction history influencing the instruction-following ability of LLMs in multi-turn interactions. Furthermore, we identify the potential direction for improvement: context management. The code and data are available at \href{https://github.com/zhu-zhu-ding/CodeIF-Bench}{https://github.com/zhu-zhu-ding/CodeIF-Bench}.

Paper Structure

This paper contains 21 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: An example of interactive code generation, where the developer provides follow-up instructions to clarify the requirement and address issues in the generated code
  • Figure 2: CodeIF-Bench construction pipeline. The top part illustrates the verifiable instruction strategy extraction process, and the bottom part presents the data collection procedure
  • Figure 3: IFR results in Static Conversation
  • Figure 4: IFR results in Dynamic Conversation
  • Figure 5: The IA results on various VI strategies
  • ...and 1 more figures