RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing
Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose
TL;DR
RepoST tackles the challenge of scalable repo-level code generation by introducing sandboxed execution environments that provide execution feedback without building entire repositories. It presents an automated pipeline for repository curation, sandboxing, test generation, and quality control, enabling large-scale training (RepoST-Train) and robust evaluation (RepoST-Eval). Training with RepoST-Train yields measurable gains on HumanEval and RepoEval, and RepoST-Eval offers a challenging benchmark across 12 models, with GPT-4o achieving 39.53% Pass@1, indicating substantial room for improvement. The framework's sandboxing approach, coupled with automated quality checks and dependency abstraction, supports contamination-free, scalable, live benchmarks for repo-level code generation.
Abstract
We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.
