Table of Contents
Fetching ...

ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing

Alec Helbling, Seongmin Lee, Polo Chau

TL;DR

This paper tackles the challenge of achieving precise image edits with natural language prompts alone by introducing ClickDiffusion, which fuses NL instructions with direct manipulation to disambiguate targets and specify exact spatial edits. It serializes the image layout and multimodal instructions into text, leverages an LLM with in-context and chain-of-thought prompting to generate an edited layout, and then renders the result using a layout-based diffusion generator. Key contributions include a novel LLM-based framework for integrating visual feedback with text instructions, a lightweight five-tool UI for accessible editing, and a few-shot prompting approach that generalizes to unseen transformations without training. The approach promises practical impact by enabling fine-grained, interactive edits with concise instructions, reducing reliance on complex prompts and enabling precise control over object location and appearance.

Abstract

Recently, researchers have proposed powerful systems for generating and manipulating images using natural language instructions. However, it is difficult to precisely specify many common classes of image transformations with text alone. For example, a user may wish to change the location and breed of a particular dog in an image with several similar dogs. This task is quite difficult with natural language alone, and would require a user to write a laboriously complex prompt that both disambiguates the target dog and describes the destination. We propose ClickDiffusion, a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface. We demonstrate that by serializing both an image and a multi-modal instruction into a textual representation it is possible to leverage LLMs to perform precise transformations of the layout and appearance of an image. Code available at https://github.com/poloclub/ClickDiffusion.

ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing

TL;DR

This paper tackles the challenge of achieving precise image edits with natural language prompts alone by introducing ClickDiffusion, which fuses NL instructions with direct manipulation to disambiguate targets and specify exact spatial edits. It serializes the image layout and multimodal instructions into text, leverages an LLM with in-context and chain-of-thought prompting to generate an edited layout, and then renders the result using a layout-based diffusion generator. Key contributions include a novel LLM-based framework for integrating visual feedback with text instructions, a lightweight five-tool UI for accessible editing, and a few-shot prompting approach that generalizes to unseen transformations without training. The approach promises practical impact by enabling fine-grained, interactive edits with concise instructions, reducing reliance on complex prompts and enabling precise control over object location and appearance.

Abstract

Recently, researchers have proposed powerful systems for generating and manipulating images using natural language instructions. However, it is difficult to precisely specify many common classes of image transformations with text alone. For example, a user may wish to change the location and breed of a particular dog in an image with several similar dogs. This task is quite difficult with natural language alone, and would require a user to write a laboriously complex prompt that both disambiguates the target dog and describes the destination. We propose ClickDiffusion, a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface. We demonstrate that by serializing both an image and a multi-modal instruction into a textual representation it is possible to leverage LLMs to perform precise transformations of the layout and appearance of an image. Code available at https://github.com/poloclub/ClickDiffusion.
Paper Structure (10 sections, 4 figures)

This paper contains 10 sections, 4 figures.

Figures (4)

  • Figure 1: ClickDiffusion is an interactive system that enables users to perform fine-grained image manipulation tasks by seamlessly combining natural language and visual prompts. (A) In our example, a user can use our user interface to select a particular dog with a bounding box and a destination using a star. These locations can be referenced symbolically in a natural language instruction. (B) By serializing the original image's layout and the multi-modal instruction we can leverage an LLM to produce an edited image layout. (C) The edited layout is then fed into a layout-based image generation system to generate an edited image. (D) Our method enables moving objects and allows for much more concise prompts than text-only editing systems like InstructPix2Pixbrooks_instructpix2pix_2023.
  • Figure 2: ClickDiffusion enables users to perform precise image manipulations that are difficult to do with text alone. A user can leverage familiar direct manipulation to specify regions or objects in an image, which can be referred to in text instructions. By combining direct manipulation and natural language based editing it becomes much easier for users to perform precise edits like: moving a particular object, adding an object in a specified location, or changing the appearance of an object.
  • Figure 3: Our approach enables a user to disambiguate a particular object from other similar objects, move it, and change its appearance. In contrast, text-only editing approaches like InstructPix2Pixbrooks_instructpix2pix_2023 and LLM Grounded Diffusionlian_llm-grounded_2023 fail to localize the manipulation to the correct object, despite requiring a much longer and more difficult to write edit instruction.
  • Figure 4: Our procedure for in-context learning involves placing several examples in the context of our LLM. Each example is composed of an input layout, instruction, a chain of thought, and output layout. These are placed sequentially in the context of the LLM after a preamble prompt.