Table of Contents
Fetching ...

PECC: Problem Extraction and Coding Challenges

Patrick Haller, Jonas Golde, Alan Akbik

TL;DR

PECC introduces a large-scale benchmark for code generation from narrative problems by fusing Advent Of Code and Project Euler challenges into $2396$ problems, demanding prose understanding, problem extraction, and executable-code generation with explicit result assertion. The authors compare diverse models under varied prompt formulations, including single-turn, multi-turn, and chain-of-thought prompting, and evaluate with $Pass@k$ and $Pass@k$-Difficulty metrics to reflect problem complexity. Key findings show narrative problems can aid certain AoC tasks but may hinder math-focused Euler tasks, while chain-of-thought prompting frequently improves coding performance despite overall difficulty; the results also highlight a gap between commercial and open-source models. PECC provides an open dataset and evaluation framework to track progress toward universal problem-solving by LLMs in realistic, narrative-rich coding scenarios.

Abstract

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

PECC: Problem Extraction and Coding Challenges

TL;DR

PECC introduces a large-scale benchmark for code generation from narrative problems by fusing Advent Of Code and Project Euler challenges into problems, demanding prose understanding, problem extraction, and executable-code generation with explicit result assertion. The authors compare diverse models under varied prompt formulations, including single-turn, multi-turn, and chain-of-thought prompting, and evaluate with and -Difficulty metrics to reflect problem complexity. Key findings show narrative problems can aid certain AoC tasks but may hinder math-focused Euler tasks, while chain-of-thought prompting frequently improves coding performance despite overall difficulty; the results also highlight a gap between commercial and open-source models. PECC provides an open dataset and evaluation framework to track progress toward universal problem-solving by LLMs in realistic, narrative-rich coding scenarios.

Abstract

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.
Paper Structure (19 sections, 4 figures, 3 tables)

This paper contains 19 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A schematic representation of the code generation and assertion process.
  • Figure 2: Contrasting Problem Descriptions from AoC. The left illustrates a narrative-style problem, rich in story and context, while the right presents a neutral-style, succinctly distilled version with their respective generated solutions with GPT-3.5-Turbo. We observe that the generated code is more concise for neutrally formulated problems, while the solutions for narrative problems tend to model the story more.
  • Figure 3: We present the percentage of Euler problems solved using gpt-3.5-turbo, categorized by their difficulty levels. For the easiest category, correct solutions were obtained in the range of 60% to 80% for coding and answering using chain-of-thought, respectively. However, as the difficulty level increases, the success rate drops rapidly. We do not report scores for difficulty levels higher than 55, as gpt-3.5-turbo did not provide any correct answers in those cases.
  • Figure 4: Comparing the accuracy with Pass@3 and Pass@3 + Difficulty for gpt-3.5-turbo and codechat-bison over the Euler and AoC subsets.