Table of Contents
Fetching ...

On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping

TL;DR

A systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior.

Abstract

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

On Data Engineering for Scaling LLM Terminal Capabilities

TL;DR

A systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior.

Abstract

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.
Paper Structure (45 sections, 20 figures, 10 tables)

This paper contains 45 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Overview of Terminal-Task-Gen. Our framework combines Dataset Adaptation, which transforms existing benchmarks into terminal prompts, with Synthetic Task Generation, which uses seed data and a Skill Taxonomy to construct targeted scenarios. The tasks from both streams are utilized during Trajectory Generation phase, where agents interact with Dockerized environments to produce solution traces, followed by Post-Processing (decontamination and filtering) to yield the final SFT dataset.
  • Figure 2: Terminal-Bench task directory structure. Each task consists of an instruction prompt, task metadata, environment files, Dockerfile, reference solution, and test cases.
  • Figure 3: Terminus 2 agent response format. The Terminus 2 agent scaffold prompts the model to output responses in a JSON format, which includes: analysis, plan, commands, and task_complete.
  • Figure 4: Impact of training data scale on model performance. Our scaling experiments show that TB2.0 performance increases with training data volume for both Qwen3-8B and Qwen3-14B.
  • Figure 5: Distribution of # tokens in the generated trajectories.
  • ...and 15 more figures