Simulating Environments with Reasoning Models for Agent Training

Yuetai Li; Huseyin A Inan; Xiang Yue; Wei-Ning Chen; Lukas Wutschitz; Janardhan Kulkarni; Radha Poovendran; Robert Sim; Saravan Rajmohan

Simulating Environments with Reasoning Models for Agent Training

Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, Saravan Rajmohan

TL;DR

The paper tackles the brittleness of LLM agents in broad, dynamic contexts by introducing environment-agnostic training via Simia-SFT (trajectory synthesis) and Simia-RL (RL with LLM-simulated feedback). It leverages a four-stage trajectory synthesis pipeline (pre-filtering, prompt design, LLM-simulation, post-processing) to produce diverse, training-ready data without real testbeds, and demonstrates RL with simulated environments to further refine policies. Across benchmarks such as the $\tau^2$-Bench, OfficeBench, and AgentBench, open models fine-tuned on simulated trajectories achieve substantial gains, with some results surpassing GPT-4o and approaching larger baselines, while RL on simulated environments yields additional improvements. The work presents a scalable, transferable pathway for agent training that replaces heavy environment engineering with flexible LLM-based simulation, enabling broader progress in real-world task handling and tool use.

Abstract

LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on $τ^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.

Simulating Environments with Reasoning Models for Agent Training

TL;DR

Abstract

Simulating Environments with Reasoning Models for Agent Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)