Table of Contents
Fetching ...

Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

Carmine Zaccagnino, Fabio Quattrini, Enis Simsar, Marta Tintoré Gazulla, Rita Cucchiara, Alessio Tonioni, Silvia Cascianelli

TL;DR

This paper tackles multi-instance text-guided image editing within flow-matching editors, where a single global velocity field $v_ heta$ risks interference across concurrent edits. It introduces Instance-Disentangled Attention, which partitions joint attention with token-space sets and two masks $M^{\mathrm{dis}}$ and $M^{\mathrm{har}}$, plus an efficient multi-prompt encoding strategy and optional domain-specific fine-tuning. The authors also present an Infographics Editing Benchmark (Crello Edit and InfoEdit) to stress-test locality and editability in many regions. Empirical results show improved edit disentanglement and locality while preserving global coherence in a single pass, with strong performance on both natural images and text-dense infographics, and favorable human and LLM judgments.

Abstract

Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

TL;DR

This paper tackles multi-instance text-guided image editing within flow-matching editors, where a single global velocity field risks interference across concurrent edits. It introduces Instance-Disentangled Attention, which partitions joint attention with token-space sets and two masks and , plus an efficient multi-prompt encoding strategy and optional domain-specific fine-tuning. The authors also present an Infographics Editing Benchmark (Crello Edit and InfoEdit) to stress-test locality and editability in many regions. Empirical results show improved edit disentanglement and locality while preserving global coherence in a single pass, with strong performance on both natural images and text-dense infographics, and favorable human and LLM judgments.

Abstract

Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.
Paper Structure (19 sections, 11 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 11 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Logic visualization of the proposed joint attention masks.
  • Figure 2: CER and AR w.r.t. the number of edits.
  • Figure 3: Inference time comparison w.r.t. the number of edits.
  • Figure 4: Qualitative results on Crello Edit, InfoEdit and LoMOE-Bench. The leftmost column is the source image, the second one is the source image with overlayed bounding boxes indicating areas in which to perform editing operations, and the other columns contain qualitative results on that same sample. For readability, we list the prompts for these samples in Appendix \ref{['sec:qualitative_prompts']}.
  • Figure 5: Detailed representation and analysis of the construction of $M^\mathrm{dis}$ when using FLUX Kontext as a baseline on a sample from LoMOE-Bench. We use the color black to indicate areas which are blocked from attending each other and the color grey to indicate areas where attention is allowed.
  • ...and 6 more figures