Table of Contents
Fetching ...

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, Emad Barsoum, William Yang Wang, Wenbo Guo

TL;DR

The paper addresses the challenge of training terminal agents with open-weight models by proposing TermiGen, a data-centric pipeline that first generates verifiable Docker-based environments and then collects resilient, error-rich trajectories through a Generator-Critic error-injection loop. This yields training data with explicit error–diagnosis–recovery cycles, mitigating exposure bias and hallucinations from purely simulated or expert-only trajectories. Fine-tuning open-weight models on TermiGen data achieves a new open-weight SOTA on TerminalBench (31.3% pass with a 32B model), and results close to capable proprietary systems in the same domain, demonstrating the practical impact of high-fidelity environments and targeted error-correction data. The work also provides extensive ablations validating the value of verifiability, error-correction training, and negative trajectories, while outlining future directions like reinforcement learning, memory-enabled agents, and transfer to real-world, large-scale infrastructures.

Abstract

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

TL;DR

The paper addresses the challenge of training terminal agents with open-weight models by proposing TermiGen, a data-centric pipeline that first generates verifiable Docker-based environments and then collects resilient, error-rich trajectories through a Generator-Critic error-injection loop. This yields training data with explicit error–diagnosis–recovery cycles, mitigating exposure bias and hallucinations from purely simulated or expert-only trajectories. Fine-tuning open-weight models on TermiGen data achieves a new open-weight SOTA on TerminalBench (31.3% pass with a 32B model), and results close to capable proprietary systems in the same domain, demonstrating the practical impact of high-fidelity environments and targeted error-correction data. The work also provides extensive ablations validating the value of verifiability, error-correction training, and negative trajectories, while outlining future directions like reinforcement learning, memory-enabled agents, and transfer to real-world, large-scale infrastructures.

Abstract

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.
Paper Structure (21 sections, 1 equation, 4 figures, 6 tables)

This paper contains 21 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Average pass rate on Terminal Bench (%) vs. model size. TermiGen models (bold) outperform other open-source baselines and approach proprietary system performance at 32B scale. Blue refers to proprietary models, green refers to general-purpose base models, and red refers to fine-tuned models.
  • Figure 2: The overall pipeline of TermiGen. Phase I generates diverse, functionally valid tasks within Docker containers via iterative refinement (\ref{['subsec:env_synthesis']}). Phase II synthesizes robust expert trajectories by actively injecting errors into the execution flow, enabling the model to learn error diagnosis and recovery (\ref{['subsec:trajectory_collection']}).
  • Figure 3: Distribution of $420$ command-line tools across $16$ functional categories. Bubble sizes are proportional to the logarithm of the number of tools in each category.
  • Figure 4: t-SNE visualization of $\approx 3,500$ tasks across $11$ categories, showing semantic clustering by task type.