Table of Contents
Fetching ...

CIFE: Code Instruction-Following Evaluation

Sravani Gunnu, Shanmukha Guttula, Hima Patel

TL;DR

CIFE introduces a constraint-centered benchmark for Python code generation, addressing the gap between functional correctness and developer-specified requirements. It pairs 1,000 tasks with multi-category constraints and evaluates both adherence and correctness using CSR, SSR, and the C2A score via an LLM-as-Judge, complemented by human validation. The work highlights that soft adherence is widespread while strict adherence remains challenging, and shows that explicit reasoning capabilities can sometimes outperform sheer model scale. The benchmark is openly released to advance research in reliable, instruction-following code generation and constraint-aware evaluation.

Abstract

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

CIFE: Code Instruction-Following Evaluation

TL;DR

CIFE introduces a constraint-centered benchmark for Python code generation, addressing the gap between functional correctness and developer-specified requirements. It pairs 1,000 tasks with multi-category constraints and evaluates both adherence and correctness using CSR, SSR, and the C2A score via an LLM-as-Judge, complemented by human validation. The work highlights that soft adherence is widespread while strict adherence remains challenging, and shows that explicit reasoning capabilities can sometimes outperform sheer model scale. The benchmark is openly released to advance research in reliable, instruction-following code generation and constraint-aware evaluation.

Abstract

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

Paper Structure

This paper contains 31 sections, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Constraint Satisfaction Rate (CSR) across models. Larger models and those with reasoning capabilities demonstrate significantly higher adherence to developer-specified constraints. The reasoning-oriented o3-mini achieves the highest CSR, while smaller models show steep declines, underscoring the challenge of reliably satisfying multiple coding requirements.
  • Figure 2: Illustration of constraint adherence in a real-world task. The developer instruction (left) specifies four requirements; the model-generated code (center) satisfies only two. The right panel summarizes satisfied (✓ ) and violated (✗) constraints, showing that syntactically correct code may still fail to meet critical developer requirements.
  • Figure 3: Overview of the benchmark creation workflow, showing task sampling, constraint categorization and generation, quality validation, and final evaluation based on constraint adherence and functional correctness.
  • Figure 4: Examples of constraint categories and sample developer-style instructions.
  • Figure 5: Category-wise SSR comparison across models. Constraints related to security/privacy and optimization are consistently the hardest to follow.
  • ...and 17 more figures