Table of Contents
Fetching ...

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

Forough Mehralian, Ryan Shar, James R. Rae, Alireza Hashemi

TL;DR

The paper addresses evaluating LLM-based code generation beyond functional correctness by introducing CodeAlignBench, a developer-instruction-focused, multi-language benchmark with Follow-up and Predefined instruction settings. It builds an instruction catalog from real developer preferences via a cross-language user study and an automated taxonomy, then assembles and verifies instruction-following tasks with an extensible evaluation framework. Evaluations on LiveBench-derived problems translated to Python, Java, and JavaScript reveal significant variation by language and instruction type, with follow-up refinements generally easier and structural changes often yielding higher accuracy; GPT-family models frequently outperform others, yet no model saturates across all tasks. The work delivers a scalable, realistic platform for developer-centered code evaluation and paves the way for richer multi-turn instruction scenarios in future research.

Abstract

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

TL;DR

The paper addresses evaluating LLM-based code generation beyond functional correctness by introducing CodeAlignBench, a developer-instruction-focused, multi-language benchmark with Follow-up and Predefined instruction settings. It builds an instruction catalog from real developer preferences via a cross-language user study and an automated taxonomy, then assembles and verifies instruction-following tasks with an extensible evaluation framework. Evaluations on LiveBench-derived problems translated to Python, Java, and JavaScript reveal significant variation by language and instruction type, with follow-up refinements generally easier and structural changes often yielding higher accuracy; GPT-family models frequently outperform others, yet no model saturates across all tasks. The work delivers a scalable, realistic platform for developer-centered code evaluation and paves the way for richer multi-turn instruction scenarios in future research.

Abstract

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.

Paper Structure

This paper contains 18 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Illustration of two instruction settings in CodeAlignBench : (a) Follow-up Instructions, where additional instructions are provided after an initial code generation.(b) Predefined Instructions, where developer constraint is embedded in the initial prompt.
  • Figure 2: LLM-Assisted Coding Procedure. Stage 1: Manual open coding to create an initial codebook. Stage 2: Exemplar-based prompting of an LLM to generate codes at scale. Stage 3: Alignment and consolidation of LLM- and human-created codebooks. Stage 4: Evaluation of LLM coding reliability against human labels using inter-rater agreement.
  • Figure 3: Instruction-following benchmarking framework for code generation
  • Figure 4: Radar plots of the top models from each family (GPT, Gemini, and Sonnet), showing performance across instruction categories (Structural, Semantic, and Cosmetic) as well as the overall aggregate. The top panels correspond to predefined tasks, while the bottom panel presents follow-up tasks. Shaded regions represent the standard error of the mean (SEM).
  • Figure 5: Pairwise evaluation task
  • ...and 4 more figures