Table of Contents
Fetching ...

FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

Yihao Liu, Ziyun Zhang, Zile He, Huaqian Cai

TL;DR

FlowMind tackles the challenge of translating free-form LLM reasoning and tool use into reliable, structured workflows by decoupling task execution from workflow construction. The Execute–Summarize framework uses an execution phase to complete tasks with domain tools, followed by a summarization phase that reconstructs a workflow graph from verified execution traces. FlowBench provides a synthetic, evaluation-rich benchmark for both task solving and workflow induction, and extensive experiments show that ES–based variants, especially ES-P&E, outperform one-stage baselines in both correctness and efficiency. The work demonstrates robustness across model scales, reveals insights into cognitive burden, and highlights practical benefits in interpretability, reproducibility, and downstream automation, while acknowledging limitations of synthetic data and summarization fidelity.

Abstract

LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows during execution often suffer from inaccuracies due to interference between the two processes. We propose an Execute-Summarize(ES) framework that decouples task execution from workflow construction: the model first completes the task using available tools, then independently reconstructs a structured workflow from execution traces. This separation improves workflow accuracy and robustness. We introduce FlowBench and show through extensive experiments that our approach outperforms existing methods, providing a reliable paradigm for grounding free-form LLM reasoning into structured workflows.

FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

TL;DR

FlowMind tackles the challenge of translating free-form LLM reasoning and tool use into reliable, structured workflows by decoupling task execution from workflow construction. The Execute–Summarize framework uses an execution phase to complete tasks with domain tools, followed by a summarization phase that reconstructs a workflow graph from verified execution traces. FlowBench provides a synthetic, evaluation-rich benchmark for both task solving and workflow induction, and extensive experiments show that ES–based variants, especially ES-P&E, outperform one-stage baselines in both correctness and efficiency. The work demonstrates robustness across model scales, reveals insights into cognitive burden, and highlights practical benefits in interpretability, reproducibility, and downstream automation, while acknowledging limitations of synthetic data and summarization fidelity.

Abstract

LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows during execution often suffer from inaccuracies due to interference between the two processes. We propose an Execute-Summarize(ES) framework that decouples task execution from workflow construction: the model first completes the task using available tools, then independently reconstructs a structured workflow from execution traces. This separation improves workflow accuracy and robustness. We introduce FlowBench and show through extensive experiments that our approach outperforms existing methods, providing a reliable paradigm for grounding free-form LLM reasoning into structured workflows.
Paper Structure (112 sections, 9 equations, 9 figures, 30 tables)

This paper contains 112 sections, 9 equations, 9 figures, 30 tables.

Figures (9)

  • Figure 1: Overview of the Execute-Summarize design in FlowMind. The Execute phase uses domain tools to generate task trajectories from user queries, while the Summarize phase converts these trajectories into a structured workflow via workflow tools.
  • Figure 2: Overview of our four-stage dataset construction pipeline. The process synthesizes domain tools, validates tool quality, generates tool-dependent tasks, and filters consistent and high-quality instances to form the final dataset. See Section \ref{['sec:dataset_construction']} for details.
  • Figure 3: Pass rates of ReAct variants on Qwen3-8B: vanilla ReAct (4.3%), Enhanced ReAct (15.1%), and ES-ReAct (20.4%).
  • Figure 4: Failure rates on Qwen3-8B across different MTI patterns: interleaved execution and graph construction, interrupted execution, and front-loaded graph construction. Interrupted execution causes 100% failure for both ReAct and P&E, highlighting severe tool-level interference.
  • Figure 5: Impact of execution outcomes and trajectory quality on summarization performance on Qwen3-8B. The left panel shows pass rates conditioned on execution success versus failure, while the right panel compares complete and error-free trajectories with incomplete or erroneous ones, for both ES-ReAct and ES-P&E.
  • ...and 4 more figures