Table of Contents
Fetching ...

CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Advait Gupta, NandaKiran Velaga, Dang Nguyen, Tianyi Zhou

TL;DR

A novel benchmark of challenging multi-turn image editing is built, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

Abstract

Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

TL;DR

A novel benchmark of challenging multi-turn image editing is built, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

Abstract

Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

Paper Structure

This paper contains 61 sections, 7 equations, 13 figures, 11 tables, 2 algorithms.

Figures (13)

  • Figure 1: CoSTA$^*$ with different cost-quality trade-off coefficients $\alpha$ vs. four recent image-editing models/agents. CoSTA$^*$ achieves Pareto optimality and dominates baselines on both metrics.
  • Figure 2: Comparison of CoSTA$^*$ with State-of-the-Art image editing models/agents, which include GenArtist wang2024genartistmultimodalllmagent, MagicBrush zhang2024magicbrushmanuallyannotateddataset, InstructPix2Pix brooks2023instructpix2pixlearningfollowimage, and CLOVA DBLP:conf/cvpr/Gao0ZMHZL24. The input images and prompts are shown on the left of the figure. The outputs generated by each method illustrate differences in accuracy, visual coherence, and the ability to multimodal tasks. Figure \ref{['fig:qual_example']} shows examples of step-by-step editing using CoSTA$^*$ with intermediate subtask outputs presented.
  • Figure 3: Comparison of CoSTA$^*$ with other planning agents. LLM-only planning is efficient but prone to failure and heuristics. Search algorithms like A$^*$ guarantee optimal paths but are computationally expensive. CoSTA$^*$ balances cost and quality by first pruning the subtask tree using an LLM, which reduces the graph of tools we conduct fine-grained A$^*$ search on.
  • Figure 4: Tool Dependency Graph (TDG). A directed graph where nodes represent tools and edges indicate dependencies. An edge $(v_1, v_2)$ means $v_1$'s output is a legal input of $v_2$. It enables toolpath search for multi-turn image-editing tasks with composite instructions.
  • Figure 5: Three stages in CoSTA$^*$: (1) an LLM generates a subtask tree based on the input and task dependencies; (2) the subtask tree spans a tool subgraph that maintains tool dependencies; and (3) A$^*$ search finds the best toolpath balancing efficiency and quality.
  • ...and 8 more figures