ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Xiangyuan Xue; Zeyu Lu; Di Huang; Zidong Wang; Wanli Ouyang; Lei Bai

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, Lei Bai

TL;DR

ComfyBench addresses how to evaluate LLM-based agents that autonomously design collaborative AI systems within ComfyUI, addressing the gap between monolithic models and integrated agent collaboration. The authors introduce ComfyAgent, a multi-agent framework that uses code-based workflow representations to compose and refine complex pipelines, and evaluate it against strong baselines across 200 tasks. Key findings show ComfyAgent achieves comparable pass rates to o1-preview and outperforms other agents on ComfyBench, but creative tasks remain challenging (roughly 15% resolved). The work demonstrates the feasibility and limitations of autonomous collaborative AI design, offering a benchmark and architecture that can drive progress toward more intelligent, autonomous AI systems.

Abstract

Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM-based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents's ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction-following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM-based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi-agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15\% of creative tasks. LLM-based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with ComfyBench is paving the way for more intelligent and autonomous collaborative AI systems.

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

TL;DR

Abstract

Paper Structure (26 sections, 12 figures, 10 tables)

This paper contains 26 sections, 12 figures, 10 tables.

Introduction
Related Work
LLM-based Agents
Collaborative AI Systems
ComfyBench
ComfyUI Platform
Benchmark Contents
Evaluation Metrics
Human Evaluation
ComfyAgent
Workflow Representation
Multi-Agent Framework
Experiments
Baseline Agents
Implementation Details
...and 11 more sections

Figures (12)

Figure 1: (a) ComfyBench is a comprehensive benchmark to evaluate agents's ability to design collaborative AI systems in ComfyUI. Given the task instruction, agents are required to learn from documents and create workflows to describe collaborative AI systems. The performance is measured by pass rate and resolve rate, reflecting whether the workflow can be correctly executed and whether the task requirements are realized. (b) ComfyAgent builds collaborative Al systems in ComfyUI by generating workflows. The workflows are converted into equivalent code so that LLMs can better understand them. ComfyAgent can learn from existing workflows and autonomously design new ones. The generated workflows can be interpreted as collaborative AI systems to complete given tasks.
Figure 2: ComfyBench provides annotations for 3205 nodes and 20 workflows, together with 200 task instructions categorized into three difficulty levels: vanilla, complex, and creative.
Figure 3: The architecture of the ComfyAgent framework. Multiple agents cooperate to design workflows in a step-by-step manner. Given the task instruction, the planner initializes the memory and produces a plan. For each step, the planner updates the plan and forms an action. Different actions, including combine, adapt, and retrieve, are then handled by corresponding agents. After combination or adaptation, refine action will be conducted to ensure the correctness. All the agents can interact with the memory, which consists of history, reference, and workspace. Once the task is deemed completed, the planner will finish the procedure and save the workflow.
Figure 4: Examples of four common formats to represent workflows: flow graph, JSON, element list, and code.
Figure 5: A sample question selected from the created questionnaires on Google Forms in the human evaluation.
...and 7 more figures

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

TL;DR

Abstract

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (12)