CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
Forough Mehralian, Ryan Shar, James R. Rae, Alireza Hashemi
TL;DR
The paper addresses evaluating LLM-based code generation beyond functional correctness by introducing CodeAlignBench, a developer-instruction-focused, multi-language benchmark with Follow-up and Predefined instruction settings. It builds an instruction catalog from real developer preferences via a cross-language user study and an automated taxonomy, then assembles and verifies instruction-following tasks with an extensible evaluation framework. Evaluations on LiveBench-derived problems translated to Python, Java, and JavaScript reveal significant variation by language and instruction type, with follow-up refinements generally easier and structural changes often yielding higher accuracy; GPT-family models frequently outperform others, yet no model saturates across all tasks. The work delivers a scalable, realistic platform for developer-centered code evaluation and paves the way for richer multi-turn instruction scenarios in future research.
Abstract
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
