Table of Contents
Fetching ...

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Yijia Wang, Yiqing Shen, Weiming Chen, Zhihai He

TL;DR

CIELR addresses complex image-editing queries that require multi-step reasoning by decoupling reasoning from editing through a structured semantic representation (SSR) of the image. The framework builds an SSR using foundation models, iteratively refines it with a chain of updates guided by an LLM, and then executes edits with a diffusion model, all in a zero-shot setup that avoids joint fine-tuning. Key contributions include the CIELR architecture, the chain-of-SSR updates for multi-step reasoning, the CIEBench dataset with the IDCS metric, and strong empirical results across three datasets, demonstrating superior semantic correctness and region preservation. This approach improves robustness and practicality for reasoning-based editing in real-world workflows while reducing computational costs associated with training large LLMs and diffusion models together.

Abstract

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

TL;DR

CIELR addresses complex image-editing queries that require multi-step reasoning by decoupling reasoning from editing through a structured semantic representation (SSR) of the image. The framework builds an SSR using foundation models, iteratively refines it with a chain of updates guided by an LLM, and then executes edits with a diffusion model, all in a zero-shot setup that avoids joint fine-tuning. Key contributions include the CIELR architecture, the chain-of-SSR updates for multi-step reasoning, the CIEBench dataset with the IDCS metric, and strong empirical results across three datasets, demonstrating superior semantic correctness and region preservation. This approach improves robustness and practicality for reasoning-based editing in real-world workflows while reducing computational costs associated with training large LLMs and diffusion models together.

Abstract

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of our proposed CIELR framework. The input is an image and a complex implicit editing instruction, and the output is an edited image.
  • Figure 2: Detailed illustration of constructing the structured semantic representation and the final structured semantic representation in dictionary format.
  • Figure 3: Detailed illustration of the chain of structured semantic representation update process on a sample from CIEBench. The left panel shows the initial structured semantic representation $\mathrm{S}^{(0)}$ The right panel presents the updated structured semantic representation $\mathrm{S}^{(1)}$ with newly added semantic details, enabling the LLM to successfully identify the region requiring modification, resulting in the precisely edited output image (bottom right).
  • Figure 4: Representative samples from our CIEBench dataset showing the three types of reasoning editing tasks.
  • Figure 5: Qualitative comparison of CIELR against baseline methods on implicit reasoning queries from our CIEBench dataset. Blue boxes indicate target regions in the input images, red boxes highlight incorrect or suboptimal edits by baseline methods, and green boxes denote successful edits. The examples of the first two rows show that the CLIP score is not always a good metric to measure the alignment between the edited image and the editing instruction.
  • ...and 1 more figures