TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models
Florian Tambon, Amin Nikanjam, Cyrine Zid, Foutse Khomh, Giuliano Antoniol
TL;DR
TaskEval introduces an IRT-based, multi-prompt framework to assess the intrinsic difficulty and discriminability of code-generation tasks for large language models. By generating 18 prompts per task through context-level transformations and rephrasing, and evaluating 8 CodeLLMs across multiple seeds, the approach yields per-task difficulty and discriminant estimates that reveal nuanced benchmark characteristics beyond aggregate accuracy. Across HumanEval+ and ClassEval, TaskEval uncovers topic- and construct-level patterns, demonstrates that human judgments diverge from CodeLLM-driven difficulty, and emphasizes the importance of prompt diversity for fair benchmarking. The framework offers a scalable, robust tool for diagnosing benchmark shortcomings and guiding the design of more informative evaluations for CodeLLMs. This work lays groundwork for adaptive benchmarking and targeted model improvement via task-centric insights.
Abstract
Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, \textit{TaskEval} can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.
