Table of Contents
Fetching ...

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam

TL;DR

ScoreFlow presents a gradient-based framework for automated, adaptive optimization of multi-agent LLM workflows and introduces Score-DPO, a score-aware variant of direct preference optimization. By using code-based workflow representations and an operator library, ScoreFlow achieves robust performance and cost efficiency across six benchmarks in QA, coding, and math, outperforming both manual and prior automated methods by 8.2%. The approach combines quantitative feedback with preference data to accelerate convergence and enable smaller models to surpass larger ones at lower costs. Theoretical analysis supports why score integration improves learning, and extensive ablations demonstrate adaptability across architectures and task types.

Abstract

Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

TL;DR

ScoreFlow presents a gradient-based framework for automated, adaptive optimization of multi-agent LLM workflows and introduces Score-DPO, a score-aware variant of direct preference optimization. By using code-based workflow representations and an operator library, ScoreFlow achieves robust performance and cost efficiency across six benchmarks in QA, coding, and math, outperforming both manual and prior automated methods by 8.2%. The approach combines quantitative feedback with preference data to accelerate convergence and enable smaller models to surpass larger ones at lower costs. Theoretical analysis supports why score integration improves learning, and extensive ablations demonstrate adaptability across architectures and task types.

Abstract

Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow

Paper Structure

This paper contains 42 sections, 2 theorems, 9 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.2

Let function $d(x, y): [0, 1]^2 \rightarrow [0, 1]$ be strictly monotonically increasing with respect to $x - y$, and function $f(x): [0, 1] \rightarrow [0, 1]$ be strictly monotonically increasing in $x$. The per-sample influence for a sample $z$ is given by: which is strictly monotonically increasing with the score $s_z$ when $-(1 - f(s_z))^{-1}\le r_z \le f^{-1}(s_z)$ holds.

Figures (8)

  • Figure 1: Pipeline of ScoreFlow. First, for each problem in the dataset, multiple workflows are generated. Next, an executor is employed to execute these workflows for corresponding problems, resulting in evaluation scores. Based on these scores, preference data is collected. Subsequently, incorporating the score information, the Score-DPO algorithm is used to fine-tune the generator. This process is iterated until the maximum number of iterations is reached or convergence is achieved.
  • Figure 2: Illustration of the inference process: Two distinct workflows are generated for two GSM8K problems, and their executed results are evaluated. The executor utilized is GPT-4o-mini, with a temperature of 0. This plot highlights the adaptivity of the generation process.
  • Figure 3: Performance comparison between ScoreFlow and Aflow across various datasets. The y-axis represents the difference in accuracy ($\%$), calculated as the win rate of ScoreFlow minus the win rate of AFlow on test set. The executor for both methods are GPT-4o-mini. The optimizer LLM (generator) for Aflow is GPT-4o-mini, while the generator for Scoreflow is Llama-3.1-8B-Instruct. Specifically, ScoreFlow achieves a $88.1\%$ performance on the combined task.
  • Figure 4: API Cost in Inference and Optimization processes. We analyze the API cost during both the inference and optimization processes, comparing different methods across various executors for the HumanEval task. The left figure illustrates the cost during inference on the testing set in relation to Pass@1 performance. The right figure highlights the total cost of optimization for ScoreFlow and AFlow. The generator for our method here is Llama-3.1-8B-Instruct.
  • Figure 5: Solve rate during iteration process.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 3.1: per-sample influence
  • Theorem 3.2
  • Lemma A.1
  • proof
  • proof : Proof of Theorem \ref{['deriscoreDPO']}