Table of Contents
Fetching ...

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya, Xingyao Wang, Carolyn Rose, Graham Neubig, Daniel Fried

TL;DR

A training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search is proposed, and it is shown that agents trained on these synthetic tasks effectively generalize to diverse real-world tasks that are not present in training.

Abstract

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

TL;DR

A training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search is proposed, and it is shown that agents trained on these synthetic tasks effectively generalize to diverse real-world tasks that are not present in training.

Abstract

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.
Paper Structure (29 sections, 6 figures, 7 tables)

This paper contains 29 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (Left) We decompose general coding agent tasks into a set of intermediate components and compute the percentage of agent actions spent on each component. Our training tasks partially cover verification and fully cover reasoning, repository exploration, and implementation, which consist of around 68% of the actions. (Right) Example actions for each component. Compared to a baseline training method, SWE-Gym swegym, training with our Hybrid-Gym significantly reduces failures due to insufficient reasoning, insufficient exploration, and failed file editing, improving the final resolved rate on SWE-Bench Verified from 20.6% to 32.4%
  • Figure 2: Scaling law analysis. Performance on SWE-bench Verified improves consistently as training data size increases from around 5% (250 trajectories) to 100% (4.4k trajectories).
  • Figure 3: Statistics of Hybrid-Gym and its subsets. Following SWE-Smith, we report the average per-instance cost of setting up the training environment for rollout. Compared to existing datasets, Hybrid-Gym covers more repositories and requires only 2 docker images to build all training instances.
  • Figure 3: Controlled experiments on training data characteristics. (a) Output Format: Removing the file editing actions (str_replace) from function localization trajectories causes a large drop in SWE-bench resolution rate. (b) Repo-Exploration: script-level code generation (LCB) does not effectively transfer to repo-level issue-solving and even underperforms documentation generation, a simple repo-level task. (c) Task Complexity: Transfer improves as training tasks become more complex. (d) Trajectory Complexity: With a fixed data size, training on longer (more agent steps) trajectories substantially improves downstream performance.
  • Figure 4: Effect of teacher model and data selection. (a) Effect of sampling strategy at fixed training budget. Repository diversity improves training but using the same repositories as in evaluation does not. (b) training on issue localization trajectories with different teacher models. o3-mini (edited) indicates the same set of trajectories as o3-mini, but with text and action steps combined.
  • ...and 1 more figures