Table of Contents
Fetching ...

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models

Florian Tambon, Amin Nikanjam, Cyrine Zid, Foutse Khomh, Giuliano Antoniol

TL;DR

TaskEval introduces an IRT-based, multi-prompt framework to assess the intrinsic difficulty and discriminability of code-generation tasks for large language models. By generating 18 prompts per task through context-level transformations and rephrasing, and evaluating 8 CodeLLMs across multiple seeds, the approach yields per-task difficulty and discriminant estimates that reveal nuanced benchmark characteristics beyond aggregate accuracy. Across HumanEval+ and ClassEval, TaskEval uncovers topic- and construct-level patterns, demonstrates that human judgments diverge from CodeLLM-driven difficulty, and emphasizes the importance of prompt diversity for fair benchmarking. The framework offers a scalable, robust tool for diagnosing benchmark shortcomings and guiding the design of more informative evaluations for CodeLLMs. This work lays groundwork for adaptive benchmarking and targeted model improvement via task-centric insights.

Abstract

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, \textit{TaskEval} can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models

TL;DR

TaskEval introduces an IRT-based, multi-prompt framework to assess the intrinsic difficulty and discriminability of code-generation tasks for large language models. By generating 18 prompts per task through context-level transformations and rephrasing, and evaluating 8 CodeLLMs across multiple seeds, the approach yields per-task difficulty and discriminant estimates that reveal nuanced benchmark characteristics beyond aggregate accuracy. Across HumanEval+ and ClassEval, TaskEval uncovers topic- and construct-level patterns, demonstrates that human judgments diverge from CodeLLM-driven difficulty, and emphasizes the importance of prompt diversity for fair benchmarking. The framework offers a scalable, robust tool for diagnosing benchmark shortcomings and guiding the design of more informative evaluations for CodeLLMs. This work lays groundwork for adaptive benchmarking and targeted model improvement via task-centric insights.

Abstract

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, \textit{TaskEval} can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.
Paper Structure (23 sections, 4 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of TaskEval Framework.
  • Figure 2: ICC between ability $\theta_i$ of a CodeLLM and the expected response $\mathop{\mathbb{E}}[p_{ij} | \theta_i, \delta_j, a_j]$ for four tasks with different $\delta_j$ and $a_j$ values.
  • Figure 3: Proportion of tasks in HumanEval+ (Top) and ClassEval (Bottom) with a given expected probability of response for each CodeLLM.
  • Figure 4: Maps of the difficulty $\delta_j$ vs discriminant $a_j$ of each task for a give benchmark. The colour represents the expected probability obtained on a given task by a hypothetical CodeLLM whose capacity $\overline{\theta}$ is the average of the capacity of our CodeLLMs: (Left) HumanEval+ tasks, (Right) ClassEval tasks.
  • Figure 5: Mean per topic of: 1) the accuracy on the tasks' prompts for each CodeLLM, 2) the tasks' difficulty (Diff), 3) the tasks' discriminant (Disc). (Top) HumanEval+, (Bottom) ClassEval.