Table of Contents
Fetching ...

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, Dandan Tu

TL;DR

CLI-Gym tackles the challenge of scaling environment-intensive agentic coding by representing runtimes as Dockerfile-driven environments and inverting healthy histories to synthesize buggy task states. It builds 1,655 CLI tasks from 29 repositories and curates 291 high-quality repair trajectories, enabling fine-tuning of LiberCoder models that achieve strong Terminal-Bench performance, including 46.1% pass@1 on v1.0. The approach demonstrates that environment-centric supervision and diverse, automatically generated tasks can close the gap with closed models and offer a public data pipeline for scalable CLI task generation. The work highlights the value of combining environment inversion with open-source trajectories to advance agentic CLI proficiency and points to practical impact in scalable, real-world devops and system-troubleshooting contexts.

Abstract

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

TL;DR

CLI-Gym tackles the challenge of scaling environment-intensive agentic coding by representing runtimes as Dockerfile-driven environments and inverting healthy histories to synthesize buggy task states. It builds 1,655 CLI tasks from 29 repositories and curates 291 high-quality repair trajectories, enabling fine-tuning of LiberCoder models that achieve strong Terminal-Bench performance, including 46.1% pass@1 on v1.0. The approach demonstrates that environment-centric supervision and diverse, automatically generated tasks can close the gap with closed models and offer a public data pipeline for scalable CLI task generation. The work highlights the value of combining environment inversion with open-source trajectories to advance agentic CLI proficiency and points to practical impact in scalable, real-world devops and system-troubleshooting contexts.

Abstract

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
Paper Structure (22 sections, 4 equations, 21 figures, 9 tables)

This paper contains 22 sections, 4 equations, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Illustration of the idea behind our CLI-Gym that brings high performance on the Terminal-Bench 1.0. (a): Code-intensive tasks, as those in the SWE-bench, can be derived with readily available code histories and context like PRs. For tasks involving intensive interaction with the environment like CLI, as those in the Terminal-Bench, we employ agents to simulate and explore environment histories guided by execution feedback, realizing scalable derivation of environmen-intensive tasks. (b): With task trajectories obtained using our CLI-Gym, the fine-tuned Qwen3-32B and Qwen3-235B-A22B-Instruct models, named as LiberCoder and denoted by red triangles, achieve Pass@1 metrics of 38.9% and 46.1%, respectively, outperforming various strong baselines.
  • Figure 2: Overview of our proposed CLI-Gym pipeline. 1) Starting from a GitHub repository, we construct a gold instance consisting of a functional environment, codebase, and associated unit tests. 2) We then derive task prompts from the unit tests and execute them with an agent to obtain failure-inducing commands. Based on the observed execution commands and failing tests, we automatically generate a corresponding problem statement. 3) Finally, the outputs from the previous steps are assembled into a standardized task instance.
  • Figure 2: Statistics comparing CLI-Gym with the Terminal-Bench 1.0 and 2.0. Except for size and cost metrics, we report the average value across instances. $^\dagger$229 instances are composed of some non-evaluation tasks and 1.0 / 2.0 test tasks.
  • Figure 3: Category distribution of problem instances we generated using CLI-Gym.
  • Figure 4: A simplified example Dockerfile snippet that induces failures in a gold pandas environment by corrupting system libraries. The agent overwrites ELF headers of critical shared libraries (libsqlite3 and libz), inducing ImportError and failures of basic Linux commands that require system-level diagnosis beyond code repair.
  • ...and 16 more figures