GNNs as Predictors of Agentic Workflow Performances
Yuanshuo Zhang, Yuchen Hou, Bohan Tang, Shuo Chen, Muhan Zhang, Xiaowen Dong, Siheng Chen
TL;DR
This work tackles the high computational cost of evaluating LLM-driven agentic workflows by introducing FLOW-GNN, a graph neural predictor that operates on workflows modeled as DAGs to forecast performance without executing the full system. It also presents FLORA-Bench, a large-scale benchmark with 600k workflow-task pairs across coding, mathematics, and reasoning, enabling rigorous evaluation using accuracy and utility. In-domain results show GNNs achieve strong predictive performance (average accuracy ≈ 0.78, utility ≈ 0.72) and robustness across several LLMs, while cross-domain generalization remains challenging. Crucially, using FLOW-GNN as a predictor accelerates optimization cycles by about 125x with modest performance loss, illustrating a practical path toward prediction-driven agentic workflow optimization and expanding the utility of GNNs in complex, multi-agent AI systems.
Abstract
Agentic workflows invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks. However, optimizing such workflows is costly and inefficient in real-world applications due to extensive invocations of LLMs. To fill this gap, this position paper formulates agentic workflows as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictors of agentic workflow performances, avoiding repeated LLM invocations for evaluation. To empirically ground this position, we construct FLORA-Bench, a unified platform for benchmarking GNNs for predicting agentic workflow performances. With extensive experiments, we arrive at the following conclusion: GNNs are simple yet effective predictors. This conclusion supports new applications of GNNs and a novel direction towards automating agentic workflow optimization. All codes, models, and data are available at https://github.com/youngsoul0731/Flora-Bench.
