Table of Contents
Fetching ...

RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose

TL;DR

RepoST tackles the challenge of scalable repo-level code generation by introducing sandboxed execution environments that provide execution feedback without building entire repositories. It presents an automated pipeline for repository curation, sandboxing, test generation, and quality control, enabling large-scale training (RepoST-Train) and robust evaluation (RepoST-Eval). Training with RepoST-Train yields measurable gains on HumanEval and RepoEval, and RepoST-Eval offers a challenging benchmark across 12 models, with GPT-4o achieving 39.53% Pass@1, indicating substantial room for improvement. The framework's sandboxing approach, coupled with automated quality checks and dependency abstraction, supports contamination-free, scalable, live benchmarks for repo-level code generation.

Abstract

We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.

RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

TL;DR

RepoST tackles the challenge of scalable repo-level code generation by introducing sandboxed execution environments that provide execution feedback without building entire repositories. It presents an automated pipeline for repository curation, sandboxing, test generation, and quality control, enabling large-scale training (RepoST-Train) and robust evaluation (RepoST-Eval). Training with RepoST-Train yields measurable gains on HumanEval and RepoEval, and RepoST-Eval offers a challenging benchmark across 12 models, with GPT-4o achieving 39.53% Pass@1, indicating substantial room for improvement. The framework's sandboxing approach, coupled with automated quality checks and dependency abstraction, supports contamination-free, scalable, live benchmarks for repo-level code generation.

Abstract

We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.

Paper Structure

This paper contains 28 sections, 10 figures, 19 tables.

Figures (10)

  • Figure 1: We can use the coding environments built by RepoST for training and evaluation. We first apply the code generation model to generate candidate solutions with the original repository as context. Then we evaluate the solutions by executing the evaluation script built by RepoST. For evaluation, we directly compute Pass@$k$ scores. For training, we add all successful solutions to the train set and further finetune the model.
  • Figure 2: The RepoST coding environment construction framework. We sandbox the target function and its dependencies to a separate evaluation script for execution, which avoids building the entire repository. We design careful quality control strategies with iterative quality improvement and post-filtering. The outcome of RepoST is a set of executable repo-level coding environments, which can be used for training and evaluation.
  • Figure 3: An example where the LLM successfully creates a mock class, Mock_API, to replace real external API calls. This enables us to execute the target function API_call, which remains exactly the same as in the original codebase, without making real API calls.
  • Figure 4: (a) Pass@1 scores on RepoEval with different numbers of training examples. (b) Pass@1 scores on RepoEval with different methods to sample 2,000 training examples. Sample-by-Example has a broader repository coverage and achieves better Pass@1. The performance is further enhanced with Rejection Sampling (Distill).
  • Figure 5: Case study 1. The original score_explicit_question function and its context extracted from the original GitHub repository. The function calls the text completion function from the OpenAI API.
  • ...and 5 more figures