PIPer: On-Device Environment Setup via Online Reinforcement Learning
Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, Yaroslav Zharov
TL;DR
PIPer tackles the environment setup problem in software engineering by training a small, on-device model to generate executable Bash scripts. It combines supervised fine-tuning via distillation from a larger model with reinforcement learning using a lightweight, verifiable reward (RLVR) that mimics runtime evaluation. A lightweight LLM-as-Judge reward formalizes script quality and guides learning without containerized execution, enabling efficient on-device training. Across EnvBench-Python, Repo2Run, and Terminal-Bench, PIPer achieves competitive performance with larger models like GPT-4o and Qwen3-32B, while offering better cost-efficiency and demonstrating meaningful generalization beyond single-task scripts.
Abstract
Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.
