Table of Contents
Fetching ...

Don't Just Fine-tune the Agent, Tune the Environment

Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, Tao Lin

TL;DR

The paper tackles data scarcity in training complex, multi-turn tool-using LLM agents. It introduces Environment Tuning, combining a structured curriculum, actionable environment augmentation, and fine-grained progress rewards to enable learning directly from problem instances without expert trajectories. Experiments on BFCL with only 400 training samples show significant in-distribution gains across base models and improved out-of-distribution generalization compared to SFT baselines. The approach shifts from trajectory imitation to environment-driven exploration, suggesting a practical path toward robust, data-efficient agent training.

Abstract

Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.

Don't Just Fine-tune the Agent, Tune the Environment

TL;DR

The paper tackles data scarcity in training complex, multi-turn tool-using LLM agents. It introduces Environment Tuning, combining a structured curriculum, actionable environment augmentation, and fine-grained progress rewards to enable learning directly from problem instances without expert trajectories. Experiments on BFCL with only 400 training samples show significant in-distribution gains across base models and improved out-of-distribution generalization compared to SFT baselines. The approach shifts from trajectory imitation to environment-driven exploration, suggesting a practical path toward robust, data-efficient agent training.

Abstract

Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce , a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.

Paper Structure

This paper contains 48 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Limitations of Existing Paradigms and the Environment Tuning Advantage. This figure contrasts three agent training approaches on a travel planning task. (Left) Supervised Fine-Tuning (SFT) on static trajectories struggles with generalization. (Center) Reinforcement Learning (RL) in a traditional environment provides only sparse, uninformative feedback. (Right) Our approach uses an augmented environment that provides actionable, fine-grained feedback upon failure.
  • Figure 2: An illustration of multi-turn tool-use scenarios, adapted from an official example in the BFCL V3 Blog patil2025bfclv3blog. All three tracks start from the same initial user request. The Base multi-turn track (center) shows a successful execution path. The Missing parameter track (top) illustrates a scenario where the agent must handle ambiguity by asking for clarification. The Missing function track (bottom) shows a case where the agent needs to recognize that a required tool is unavailable. These scenarios highlight the diverse reasoning capabilities our curriculum is designed to address.
  • Figure 3: An overview of Environment Tuning. Our core innovation is the Environment Tuning module, which implements a four-stage curriculum. It dynamically configures the reward function, environment feedback (Standard vs. Augmented), and data split for the Agent Learning Loop. This staged approach transforms ambiguous errors into actionable lessons (highlighted in the Environment Augmentation in Action panel), enabling efficient and stable learning from limited data.
  • Figure 4: Training dynamics comparison for Environment Tuning on Qwen2.5-7B-Instruct. (a) The effect of Actionable Environment Augmentation on learning stability and performance across different data splits. (b) The impact of fine-grained Progress Reward versus binary reward on training effectiveness, showing the critical role of dense reward signals in complex multi-turn scenarios.
  • Figure 5: Training dynamics of Environment Tuning on the Qwen2.5-7B-Instruct model using the BFCL V3 dataset. A held-out set of 100 samples from the remaining BFCL data is used for validation. (a) In Stage 1, the agent rapidly masters syntactic correctness, shown by the steep rise in format and tool call rewards and the drop in interaction rounds. (b) Across the full four-stage curriculum, the agent demonstrates both steady performance improvement on the validation set and stable gradient norms, showcasing the effectiveness and stability of our staged learning approach.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 3.1: Checkpoint selection for stage transitions