Table of Contents
Fetching ...

PyBench: Evaluating LLM Agent on various real-world coding tasks

Yaolun Zhang, Yinxu Pan, Yudong Wang, Jie Cai

TL;DR

<3-5 sentence high-level summary> PyBench addresses the lack of benchmarks for evaluating LLM Agents performing real-world coding tasks that involve interacting with files through a Python code interpreter. It defines a multi-turn evaluation framework with five task categories, automated unit tests, and an LLM-based evaluator to assess both correctness and efficiency. The authors synthesize four training datasets (including PyInstruct, CodeFeedback, CodeActInstruct, and UltraChat) and show that continued pretraining on code-rich data plus targeted fine-tuning yields PyLlama3, which outperforms many larger models on PyBench. The work demonstrates that real-world coding proficiency requires not only coding ability but also planning, multi-turn reasoning, and effective use of code feedback, offering a practical benchmark to advance usable LLM Agents for everyday programming tasks.

Abstract

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: {https://github.com/Mercury7353/PyBench}

PyBench: Evaluating LLM Agent on various real-world coding tasks

TL;DR

<3-5 sentence high-level summary> PyBench addresses the lack of benchmarks for evaluating LLM Agents performing real-world coding tasks that involve interacting with files through a Python code interpreter. It defines a multi-turn evaluation framework with five task categories, automated unit tests, and an LLM-based evaluator to assess both correctness and efficiency. The authors synthesize four training datasets (including PyInstruct, CodeFeedback, CodeActInstruct, and UltraChat) and show that continued pretraining on code-rich data plus targeted fine-tuning yields PyLlama3, which outperforms many larger models on PyBench. The work demonstrates that real-world coding proficiency requires not only coding ability but also planning, multi-turn reasoning, and effective use of code feedback, offering a practical benchmark to advance usable LLM Agents for everyday programming tasks.

Abstract

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: {https://github.com/Mercury7353/PyBench}
Paper Structure (37 sections, 1 equation, 11 figures, 7 tables)

This paper contains 37 sections, 1 equation, 11 figures, 7 tables.

Figures (11)

  • Figure 1: An Overview of LLMs' performance on PyBench
  • Figure 2: The construction and evaluation workflow of PyBench
  • Figure 3: Function Call vs. Our Code Interpreter Format
  • Figure 4: Generating Trajectory Data by ReAct
  • Figure 5: Prompt equipping LLM Agent with a code interpreter
  • ...and 6 more figures