Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

Yiqing Xie; Emmy Liu; Gaokai Zhang; Nachiket Kotalwar; Shubham Gandhi; Sathwik Acharya; Xingyao Wang; Carolyn Rose; Graham Neubig; Daniel Fried

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya, Xingyao Wang, Carolyn Rose, Graham Neubig, Daniel Fried

TL;DR

A training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search is proposed, and it is shown that agents trained on these synthetic tasks effectively generalize to diverse real-world tasks that are not present in training.

Abstract

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

TL;DR

Abstract

Paper Structure (29 sections, 6 figures, 7 tables)

This paper contains 29 sections, 6 figures, 7 tables.

Introduction
Hybrid-Gym: Constructing Scalable Multitask Coding Agent Training Data
Analysis of Coding Agent Task Transferability
Principles for Training Task Design
Hybrid-Gym Tasks
Verification of Hybrid-Gym Tasks
Experimental Results
Experimental Setup
Main Results
Detailed Results
Analysis: on the Task Transferability of Coding Agent Training
Output Format Must Match Downstream Tasks
Script-level Agentic Tasks Do NOT Generalize to Repo-level Tasks
Task and Trajectory Complexity Matters
Even Successful Trajectories of the Same Task Instances may have Different Impact
...and 14 more sections

Figures (6)

Figure 1: (Left) We decompose general coding agent tasks into a set of intermediate components and compute the percentage of agent actions spent on each component. Our training tasks partially cover verification and fully cover reasoning, repository exploration, and implementation, which consist of around 68% of the actions. (Right) Example actions for each component. Compared to a baseline training method, SWE-Gym swegym, training with our Hybrid-Gym significantly reduces failures due to insufficient reasoning, insufficient exploration, and failed file editing, improving the final resolved rate on SWE-Bench Verified from 20.6% to 32.4%
Figure 2: Scaling law analysis. Performance on SWE-bench Verified improves consistently as training data size increases from around 5% (250 trajectories) to 100% (4.4k trajectories).
Figure 3: Statistics of Hybrid-Gym and its subsets. Following SWE-Smith, we report the average per-instance cost of setting up the training environment for rollout. Compared to existing datasets, Hybrid-Gym covers more repositories and requires only 2 docker images to build all training instances.
Figure 3: Controlled experiments on training data characteristics. (a) Output Format: Removing the file editing actions (str_replace) from function localization trajectories causes a large drop in SWE-bench resolution rate. (b) Repo-Exploration: script-level code generation (LCB) does not effectively transfer to repo-level issue-solving and even underperforms documentation generation, a simple repo-level task. (c) Task Complexity: Transfer improves as training tasks become more complex. (d) Trajectory Complexity: With a fixed data size, training on longer (more agent steps) trajectories substantially improves downstream performance.
Figure 4: Effect of teacher model and data selection. (a) Effect of sampling strategy at fixed training budget. Repository diversity improves training but using the same repositories as in evaluation does not. (b) training on issue localization trajectories with different teacher models. o3-mini (edited) indicates the same set of trajectories as o3-mini, but with text and action steps combined.
...and 1 more figures

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

TL;DR

Abstract

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)