ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

Rinon Gal; Adi Haviv; Yuval Alaluf; Amit H. Bermano; Daniel Cohen-Or; Gal Chechik

ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik

TL;DR

This work proposes two LLM-based approaches to tackle the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt, and proposes a tuning-based method that learns from user-preference data and a training-free method that uses the LLM to select existing flows.

Abstract

The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.

ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (19 sections, 7 figures, 2 tables)

This paper contains 19 sections, 7 figures, 2 tables.

Introduction
Related work
Improving Text-to-image generation quality.
LLM-based tool selection and Agents
Worfklow generation
Method
ComfyUI
Training Data
ComfyGen-IC
ComfyGen-FT
Implementation details
Experiments
Analysis
Originality and diversity
Analyzing the chosen flows
...and 4 more sections

Figures (7)

Figure 1: The standard text-to-image generation flow (top) employs a single monolithic model to transform a prompt into an image. However, the user community often relies on complex multi-model workflows, hand-crafted by expert users for different scenarios. We leverage an LLM to automatically synthesize such workflows, conditioned on the user's prompt (bottom). By choosing components that better match the prompt, the LLM improves the quality of the generated image.
Figure 2: (a) A simple ComfyUI pipeline using a base model and a face restoration block, as well as both a positive and a negative prompt. (b) Distribution of scores for the prompt, flow pairs in our training set. (c) Example images produced for the same prompt by flows with different scores. A higher score typically correlates with more detailed and vibrant results, and fewer artifacts.
Figure 3: Our method can generate higher quality images across diverse domains and styles. Prompts are available in the supplementary.
Figure 4: Qualitative results on GenEval prompts. ComfyGen shows better performance on multi-subject prompts, colorization and attribute binding, but may struggle with positioning.
Figure 5: HPS V2.0 and User Study win rates. We compare each baseline against both ComfyGen-FT (green) and ComfyGen-IC (teal). ComfyGen variants are favored over all baselines.
...and 2 more figures

ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

TL;DR

Abstract

ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)