CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Peiding Wang; Li Zhang; Fang Liu; Lin Shi; Minxiao Li; Bo Shen; An Fu

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, An Fu

TL;DR

CodeIF-Bench addresses the gap in evaluating instruction-following by LLMs during interactive code generation. It introduces verifiable instruction strategies, a dual conversation regime (Static and Dynamic), and a test-driven evaluation framework that measures IA, CA, IFR, and CIF across 6 LLMs. The study reveals that increasing repository context and dialogue history degrades instruction-following, while prompting strategies that manage context (notably CI and CoT) can improve robustness and efficiency. These findings point to context management as a critical direction for advancing interactive coding with LLMs and have practical implications for real-world development workflows.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs' abilities to generate code that strictly follows users' instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both Static Conversation and Dynamic Conversation settings, we evaluate the performance of 6 state-of-the-art LLMs and summarize the important factors, additional repository context and gradually increasing interaction history influencing the instruction-following ability of LLMs in multi-turn interactions. Furthermore, we identify the potential direction for improvement: context management. The code and data are available at \href{https://github.com/zhu-zhu-ding/CodeIF-Bench}{https://github.com/zhu-zhu-ding/CodeIF-Bench}.

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

TL;DR

Abstract

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)