Table of Contents
Fetching ...

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

Mohammadreza Samadi, Fred X. Han, Mohammad Salameh, Hao Wu, Fengyu Sun, Chunhua Zhou, Di Niu

TL;DR

FunEditor introduces a diffusion-model editing framework that performs complex, localized image edits by aggregating simple, atomic editing functions. It learns trainable task tokens and employs cross-attention masking to apply multiple edits simultaneously to specified regions, enabling efficient four-step inference with no energy-guided optimization. The approach demonstrates superior object movement and pasting results, achieving higher image-quality metrics and substantially lower latency compared with both training-based and training-free baselines on COCOEE and ReS datasets. By leveraging function aggregation, FunEditor provides a data-efficient, scalable path to complex image editing that preserves region fidelity and object appearance during composition. The method is compatible with existing few-step diffusion backbones, offering practical impact for real-time or interactive editing workflows.

Abstract

Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet with two key challenges remaining. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations to the image. We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. Our experiments demonstrate that FunEditor significantly outperforms recent inference-time optimization methods and fine-tuned models, either quantitatively across various metrics or through visual comparisons or both, on complex tasks like object movement and object pasting. In the meantime, with only 4 steps of inference, FunEditor achieves 5-24x inference speedups over existing popular methods. The code is available at: mhmdsmdi.github.io/funeditor/.

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

TL;DR

FunEditor introduces a diffusion-model editing framework that performs complex, localized image edits by aggregating simple, atomic editing functions. It learns trainable task tokens and employs cross-attention masking to apply multiple edits simultaneously to specified regions, enabling efficient four-step inference with no energy-guided optimization. The approach demonstrates superior object movement and pasting results, achieving higher image-quality metrics and substantially lower latency compared with both training-based and training-free baselines on COCOEE and ReS datasets. By leveraging function aggregation, FunEditor provides a data-efficient, scalable path to complex image editing that preserves region fidelity and object appearance during composition. The method is compatible with existing few-step diffusion backbones, offering practical impact for real-time or interactive editing workflows.

Abstract

Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet with two key challenges remaining. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations to the image. We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. Our experiments demonstrate that FunEditor significantly outperforms recent inference-time optimization methods and fine-tuned models, either quantitatively across various metrics or through visual comparisons or both, on complex tasks like object movement and object pasting. In the meantime, with only 4 steps of inference, FunEditor achieves 5-24x inference speedups over existing popular methods. The code is available at: mhmdsmdi.github.io/funeditor/.
Paper Structure (18 sections, 5 equations, 6 figures, 3 tables)

This paper contains 18 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Results of applying the two basic functions—Edge Enhancement (center) and Object Removal (right)—using their respective masks on the input image Best viewed when enlarged. (b) Demonstration of function aggregation using the proposed method. By simultaneously applying Object Removal and Edge Enhancement on different masks, complex edits such as object shrinking (middle) and object movement (right) can be achieved. $\mathcal{A}$ represents the operation of function aggregation.
  • Figure 2: Our approach is capable of composing multiple editing functions and applying them simultaneously. This enables it to perform complex edit functions such as object movement, object resizing, and object pasting in 4 steps. $f_{OR}$, $f_{EE}$, and $f_{HR}$ refer to object removal, edge enhancement, and harmonization functions, respectively. Each function is applied only to the specified mask region. To save space, source image I is omitted from the function arguments.
  • Figure 3: Overview of our proposed training and inference pipeline. During the basic task training phase (a) the diffusion model learns to perform various simple tasks based on the provided task tokens and masks. During inference (b), we could implement complex edit functions by combining multiple task masks and tokens.
  • Figure 4: Harmonization without cross-attention masking affects the entire image (a). While with masking, edits are confined to the masked region, (b), preventing changes to unmasked areas. Mask is indicated by the top right mini-figure.
  • Figure 5: Qualitative comparison between our approach and baseline methods for object repositioning within an image, demonstrating the superior performance of our method. To move an object, FunEditor composes object removal and edge enhancement functions.
  • ...and 1 more figures