HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Zhaojian Yu; Yilun Zhao; Arman Cohan; Xiao-Ping Zhang

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang

TL;DR

Self-invoking code generation benchmarks extend traditional code-generation tasks by requiring LLMs to solve a base problem and then reuse their own generated code to tackle a related, more complex problem. The authors present a scalable benchmark construction recipe and validate it across more than 20 LLMs, revealing a substantial performance drop on self-invoking tasks and only marginal gains from instruction-tuning. They analyze failure modes (e.g., AssertionError, NameError) and demonstrate that chain-of-thought prompting can improve some results, highlighting gaps in current models' autonomous code reasoning. The work introduces HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro and provides a foundation to drive future improvements in program reasoning and self-referential code generation. This has practical significance for advancing robust, autonomous coding assistants capable of deeper reasoning and self-improvement.

Abstract

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 10 figures, 9 tables)

This paper contains 34 sections, 1 equation, 10 figures, 9 tables.

Introduction
Related Work
Benchmarks for Code Generation
LLMs for Code Generation
Benchmark Construction
Self-invoking Problem Generation
Solution Generation
Test Cases Generation
Experiments
Analysis
Base Model vs Instruct Model
Confusion Matrix Correlation for Different Models
Chain-of-Thought Prompting
Error Analysis
Generalization Study of Self-invoking Code Generation
...and 19 more sections

Figures (10)

Figure 1: The overview of self-invoking code generation in HumanEval Pro and MBPP Pro. Given a base problem and a related, more complex problem, they are required to solve the base problem and use its solution to address the complex problems.
Figure 2: The overview of benchmark construction. An example is shown in \ref{['fig:bc_exp']}. We summarize the entire benchmark construction process as follows: (1) Self-invoking problem Generation: We use Deepseek-V2.5 to generate the self-invoking problems, as well as their candidate solutions and test inputs. (2) Solutions Generation: We execute the generated solution with the test inputs in a controlled Python environment to obtain ground truth outputs. (3) Test Cases Generation: We employ an iterative method involving Python execution check and manual review to ensure that all test cases pass successfully. The final execution results are then used to construct complete test cases with assert command.
Figure 3: Performance Comparison: HumanEval Pro (and MBPP Pro) vs. HumanEval (and MBPP).
Figure 4: HumanEval (or MBPP) scores against the results on HumanEval Pro and MBPP Pro (HumanEval+ and MBPP+). We presents the comparison between base model and instruct model.
Figure 5: The confusion matrix of different models. We use (Failed, Passed) to indicate samples that fail in HumanEval Pro (or MBPP Pro) but pass in HumanEval (or MBPP).
...and 5 more figures

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

TL;DR

Abstract

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)