Table of Contents
Fetching ...

PIPer: On-Device Environment Setup via Online Reinforcement Learning

Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, Yaroslav Zharov

TL;DR

PIPer tackles the environment setup problem in software engineering by training a small, on-device model to generate executable Bash scripts. It combines supervised fine-tuning via distillation from a larger model with reinforcement learning using a lightweight, verifiable reward (RLVR) that mimics runtime evaluation. A lightweight LLM-as-Judge reward formalizes script quality and guides learning without containerized execution, enabling efficient on-device training. Across EnvBench-Python, Repo2Run, and Terminal-Bench, PIPer achieves competitive performance with larger models like GPT-4o and Qwen3-32B, while offering better cost-efficiency and demonstrating meaningful generalization beyond single-task scripts.

Abstract

Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.

PIPer: On-Device Environment Setup via Online Reinforcement Learning

TL;DR

PIPer tackles the environment setup problem in software engineering by training a small, on-device model to generate executable Bash scripts. It combines supervised fine-tuning via distillation from a larger model with reinforcement learning using a lightweight, verifiable reward (RLVR) that mimics runtime evaluation. A lightweight LLM-as-Judge reward formalizes script quality and guides learning without containerized execution, enabling efficient on-device training. Across EnvBench-Python, Repo2Run, and Terminal-Bench, PIPer achieves competitive performance with larger models like GPT-4o and Qwen3-32B, while offering better cost-efficiency and demonstrating meaningful generalization beyond single-task scripts.

Abstract

Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.

Paper Structure

This paper contains 31 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the proposed training pipeline. (a) SFT training: For the $i$-th sample (a repository), both teacher and student LLMs receive the prompt $q_i$, which includes the task description and repository context. They generate completions $o_i^t$ and $o_i^s$, respectively, expected to contain a shell script. The student model’s weights are updated by minimizing the cross-entropy loss between its output distribution and the teacher’s completion. (b) RL training: For each sample, LLM $\pi_\theta$ generates a completion $o_i$, expected to contain a shell script. The completion is evaluated by a rule-based reward function $R$, which outputs a score $R_i$. The REINFORCE++ algorithm then updates the LLM weights using the rewards $R_i$ and responses $o_i$.
  • Figure 1: Results on Repo2Run and Terminal-Bench for base models and our tuned Qwen3-8B. For Repo2Run, success is determined as a zero exit code and no test collection errors. For Terminal-Bench, success is determined by per-sample evaluation commands. Our PIPer model achieves the best performance on Repo2Run. However, SFT-based models underperform on Terminal-Bench's multi-turn setting.
  • Figure 2: RLVR training dynamics with the proxy rewards described in \ref{['sec:method:proxy-rewards']}. Raw datapoints are shown as semi-transparent dots, with Gaussian-smoothed curves overlaid to highlight trends. Blue shows average reward on the training set; orange shows average reward on the validation set. The x-axis is training steps, and the y-axis is average reward. Evolution of the LLM-as-a-Judge reward $R_{\text{LLM}}$(a) over the base model, (b) over the SFT model.
  • Figure 3: Performance analysis of environment setup models on EnvBench-Python. (a) Pass@$N$ performance showing how model success rates improve with multiple attempts ($N=1$ to $5$). Our PIPer model (shown with cross markers) achieves performance comparable to much larger models like GPT-4o and Qwen3-32B, while substantially outperforming the base Qwen3-8B model. (b) Cost-performance tradeoff analysis comparing average pass@1 performance (averaged over five runs) against price per 1M output tokens (USD).