Table of Contents
Fetching ...

Simulating Environments with Reasoning Models for Agent Training

Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, Saravan Rajmohan

TL;DR

The paper tackles the brittleness of LLM agents in broad, dynamic contexts by introducing environment-agnostic training via Simia-SFT (trajectory synthesis) and Simia-RL (RL with LLM-simulated feedback). It leverages a four-stage trajectory synthesis pipeline (pre-filtering, prompt design, LLM-simulation, post-processing) to produce diverse, training-ready data without real testbeds, and demonstrates RL with simulated environments to further refine policies. Across benchmarks such as the $\tau^2$-Bench, OfficeBench, and AgentBench, open models fine-tuned on simulated trajectories achieve substantial gains, with some results surpassing GPT-4o and approaching larger baselines, while RL on simulated environments yields additional improvements. The work presents a scalable, transferable pathway for agent training that replaces heavy environment engineering with flexible LLM-based simulation, enabling broader progress in real-world task handling and tool use.

Abstract

LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on $τ^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.

Simulating Environments with Reasoning Models for Agent Training

TL;DR

The paper tackles the brittleness of LLM agents in broad, dynamic contexts by introducing environment-agnostic training via Simia-SFT (trajectory synthesis) and Simia-RL (RL with LLM-simulated feedback). It leverages a four-stage trajectory synthesis pipeline (pre-filtering, prompt design, LLM-simulation, post-processing) to produce diverse, training-ready data without real testbeds, and demonstrates RL with simulated environments to further refine policies. Across benchmarks such as the -Bench, OfficeBench, and AgentBench, open models fine-tuned on simulated trajectories achieve substantial gains, with some results surpassing GPT-4o and approaching larger baselines, while RL on simulated environments yields additional improvements. The work presents a scalable, transferable pathway for agent training that replaces heavy environment engineering with flexible LLM-based simulation, enabling broader progress in real-world task handling and tool use.

Abstract

LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on -Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.

Paper Structure

This paper contains 38 sections, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Performance of models fine-tuned on our synthetic simulated trajectories without real environment implementations. Our 32B model (based on Qwen2.5-32B-Instruct) surpasses GPT-4o and xLAM-2-70B model and our 8B model (based on Qwen3-8B) outperforms Qwen2.5-32B-Instruct on $\tau^2$-Airline and Retail.
  • Figure 2: LLM can reason to simulate plausible environment feedback without requiring access to all actual testbed data or system information.
  • Figure 3: Simia-SFT pipeline to synthesize agent trajectory data without real environment executions. The diagram shows the flow from seed trajectory through pre-filtering, prompt design, LLM simulation and final sanity check.
  • Figure 4: Simia-RL framework, which enables RL through multi-turn interactions within simulated environments. An LLM-based simulator provides both environment feedback and reward signals to support iterative policy optimization.
  • Figure 5: Passk performance comparison on the $\tau^2$-Bench across Airline and Retail domains for Simia-Tau and xLAM-2 models, with $k$ values of 1, 2, and 3. Passk requires that each task should be successful for all the k retries, highlighting the robustness.
  • ...and 13 more figures