Table of Contents
Fetching ...

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu

TL;DR

This work tackles the challenges of long-horizon, tool-augmented reasoning by introducing AgentFlow, a trainable in-the-flow agentic system that coordinates four specialized modules through an evolving memory. It advances Flow-GRPO, an on-policy, trajectory-level reward framework that broadcasts a single final-outcome signal to all turns and stabilizes learning via group-normalized advantages, effectively turning multi-turn RL into a series of single-turn updates. Across ten diverse benchmarks with a 7B backbone, AgentFlow substantially outperforms strong baselines and even surpasses GPT-4o on average, while demonstrating improved planning, tool-calling reliability, and scalable gains with larger backbones and longer turn budgets. The results highlight the practical potential of in-flow optimization for robust, adaptive tool use in complex reasoning tasks and point to future work on extending training to additional modules and richer reward signals.

Abstract

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

TL;DR

This work tackles the challenges of long-horizon, tool-augmented reasoning by introducing AgentFlow, a trainable in-the-flow agentic system that coordinates four specialized modules through an evolving memory. It advances Flow-GRPO, an on-policy, trajectory-level reward framework that broadcasts a single final-outcome signal to all turns and stabilizes learning via group-normalized advantages, effectively turning multi-turn RL into a series of single-turn updates. Across ten diverse benchmarks with a 7B backbone, AgentFlow substantially outperforms strong baselines and even surpasses GPT-4o on average, while demonstrating improved planning, tool-calling reliability, and scalable gains with larger backbones and longer turn budgets. The results highlight the practical potential of in-flow optimization for robust, adaptive tool use in complex reasoning tasks and point to future work on extending training to additional modules and richer reward signals.

Abstract

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

Paper Structure

This paper contains 71 sections, 3 theorems, 18 equations, 12 figures, 3 tables, 1 algorithm.

Key Result

Theorem B.1

In Flow-GRPO, maximizing the global multi-turn objective is mathematically equivalent to maximizing the expected token-level local objective at each time step under the on-policy induced state distribution, given standard sampling assumptions (trajectories sampled i.i.d. from the policy with fixed f

Figures (12)

  • Figure 1: Left: Performance of AgentFlow with a 7B-scale backbone before and after Flow-GRPO tuning across ten diverse reasoning benchmarks. Flow-GRPO substantially improves performance by enhancing planning quality and tool-calling reliability. Right:AgentFlow achieves consistent gains over top baselines, including base LLMs, tool-integrated RL models, and training-free agentic systems. All 7B results use Qwen2.5-7B-Base/Instruct as the backbone and tools.
  • Figure 2: (a) Overview of AgentFlow, a trainable agentic system for in-the-flow planning and tool use. Four modules (planner, executor, verifier, generator) coordinate via a shared evolving memory $M$ and toolset $K$, given a query $q$. The planner policy is optimized on-policy inside the system's multi-turn loop to enable adaptive, long-horizon reasoning. (b) A single state transition, showing the action $a^t$, execution result $e^t$, and verifier signal $v^t$ that update the memory from $M^t$ to $M^{t+1}$.
  • Figure 3: Comparison of two paradigms of LLMs with tool use. (a) Monolithic tool-integrated reasoning models train a single policy to interleave reasoning (e.g., <think>) and tool calls (e.g., <tool_call>) within a single, full-context trajectory. (b) Agentic systems decompose tasks across multiple specialized modules (e.g., planner, coder) that collaborate. These systems are typically training-free, orchestrated by handcrafted logic or prompting.
  • Figure 4: Optimization for our proposed agentic system AgentFlow. Given a query $q$, an evolving memory $M$, and a toolset $K$, the policy model generates actions that target sub-goals and select tools. It is trained via Flow-based Group Refined Policy Optimization (Flow-GRPO), which enables multi-turn reinforcement learning and stable optimization under collaborative dynamics.
  • Figure 5: Tool call ratio change by Flow-GRPO fine-tuning.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Definition B.1: Core Components
  • Definition B.2: Objective Functions
  • Theorem B.1
  • proof
  • Lemma B.2: Policy Performance Difference
  • Theorem B.3: Monotonic Improvement Guarantee