HumanEval on Latest GPT Models -- 2024

Daniel Li; Lincoln Murr

HumanEval on Latest GPT Models -- 2024

Daniel Li, Lincoln Murr

TL;DR

This work evaluates GPT-4's code-generation capabilities on the OpenAI HumanEval benchmark, examining the impact of prompt engineering versus native model abilities and the potential emergence of AGI-like behaviours in code synthesis. Using $pass@1$ and $pass@10$ metrics, the study shows GPT-4 delivering substantial gains over prior models (e.g., $pass@1 \approx 85.7$, $pass@10 \approx 98.2$) and highlights the value and cost of advanced prompt strategies such as LATS and Reflexion. It also discusses the trade-offs of prompt engineering, the prospects for future LLMs to reduce reliance on prompts, and contributes an open-source code and dataset to enable replication and further research. Collectively, the results inform the design of next-generation LLM-assisted code-generation systems and provide practical guidance for researchers evaluating and deploying these models.

Abstract

In 2023, we are using the latest models of GPT-4 to advance program synthesis. The large language models have significantly improved the state-of-the-art for this purpose. To make these advancements more accessible, we have created a repository that connects these models to Huamn Eval. This dataset was initally developed to be used with a language model called CODEGEN on natural and programming language data. The utility of these trained models is showcased by demonstrating their competitive performance in zero-shot Python code generation on HumanEval tasks compared to previous state-of-the-art solutions. Additionally, this gives way to developing more multi-step paradigm synthesis. This benchmark features 160 diverse problem sets factorized into multistep prompts that our analysis shows significantly improves program synthesis over single-turn inputs. All code is open source at https://github.com/daniel442li/gpt-human-eval .

HumanEval on Latest GPT Models -- 2024

TL;DR

and

metrics, the study shows GPT-4 delivering substantial gains over prior models (e.g.,

) and highlights the value and cost of advanced prompt strategies such as LATS and Reflexion. It also discusses the trade-offs of prompt engineering, the prospects for future LLMs to reduce reliance on prompts, and contributes an open-source code and dataset to enable replication and further research. Collectively, the results inform the design of next-generation LLM-assisted code-generation systems and provide practical guidance for researchers evaluating and deploying these models.

Abstract

Paper Structure (28 sections, 1 table)

This paper contains 28 sections, 1 table.

Introduction
Background on LLMs
Prompt Engineering
Research Goal: Prompt Engineering vs Native Abilities
Significance of the Study
Objectives of the Study
Key Contributions
Methodology
Goals
Sub-Research Questions
Evaluation Set
Experiment Design
Ability for Models to Pass HumanEval
Method
Evaluation Criteria
...and 13 more sections

HumanEval on Latest GPT Models -- 2024

TL;DR

Abstract

HumanEval on Latest GPT Models -- 2024

Authors

TL;DR

Abstract

Table of Contents