Table of Contents
Fetching ...

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song

Abstract

Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Abstract

Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

Paper Structure

This paper contains 52 sections, 7 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Framework overview using hotel booking as a running example. Phase 1: Collect real API responses and curate workflow templates with dependency structures. Phase 2: Sample cache entries following dependencies, then generate queries matching sampled parameters. Phase 3: LLM interacts with the deterministic environment through multi-turn rollouts, receiving graduated rewards ($R_{\text{atomic}}$ + $R_{\text{orch}}$) for GRPO updates.
  • Figure 2: Training dynamics under different reward configurations. Left: $R_{\text{atomic}}$ only achieves high atomic validity but orchestration collapses. Middle: $R_{\text{orch}}$ only achieves perfect orchestration but atomic validity drops. Right: Combined reward achieves balanced improvement.
  • Figure 3: Breakdown of $R_{\text{atomic}}$ into AST validation (static) and semantic validation (execution). $R_{\text{orch}}$ only training maintains $R_{\text{AST}}$ but $R_{\text{sem}}$ collapses, indicating syntactically valid but semantically broken calls.
  • Figure 4: Turn Accuracy (%) stratified by (a) dependency depth and (b) dependency pattern. Performance degrades with depth and for fan-out patterns. RL training provides consistent improvements across all stratifications.
  • Figure 5: Case study: parallel conjunction requiring both hotel and attraction searches. The baseline completes only one branch; the RL model learns to execute both.
  • ...and 2 more figures