CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion
Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, Dandan Tu
TL;DR
CLI-Gym tackles the challenge of scaling environment-intensive agentic coding by representing runtimes as Dockerfile-driven environments and inverting healthy histories to synthesize buggy task states. It builds 1,655 CLI tasks from 29 repositories and curates 291 high-quality repair trajectories, enabling fine-tuning of LiberCoder models that achieve strong Terminal-Bench performance, including 46.1% pass@1 on v1.0. The approach demonstrates that environment-centric supervision and diverse, automatically generated tasks can close the gap with closed models and offer a public data pipeline for scalable CLI task generation. The work highlights the value of combining environment inversion with open-source trajectories to advance agentic CLI proficiency and points to practical impact in scalable, real-world devops and system-troubleshooting contexts.
Abstract
Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
