Table of Contents
Fetching ...

POEM: Precise Object-level Editing via MLLM control

Marco Schouten, Mehmet Onurcan Kaya, Serge Belongie, Dim P. Papadopoulos

TL;DR

POEM tackles the challenge of precise object-level image editing by decoupling visual reasoning from editing and leveraging Multimodal Large Language Models (MLLMs) to produce exact object masks before and after transformations. The framework comprises a five-step process (visual grounding, refinement, edit parsing, transformation, and edit-guided image-to-image translation) guided by MLLM reasoning, enabling high-precision edits within diffusion models without fine-tuning. To validate the approach, the authors introduce VOCEdits, a PASCAL VOC 2012–based dataset with instructional prompts, ground-truth transformations, and masks, and demonstrate that POEM achieves higher edit fidelity than text-based baselines and reduces manual effort relative to interaction-based methods. The work advances controllable image synthesis by integrating MLLMs with diffusion-based editing and provides a rigorous benchmark for object-level edits, highlighting both its strengths and limitations in extreme or non-rigid transformations.

Abstract

Diffusion models have significantly improved text-to-image generation, producing high-quality, realistic images from textual descriptions. Beyond generation, object-level image editing remains a challenging problem, requiring precise modifications while preserving visual coherence. Existing text-based instructional editing methods struggle with localized shape and layout transformations, often introducing unintended global changes. Image interaction-based approaches offer better accuracy but require manual human effort to provide precise guidance. To reduce this manual effort while maintaining a high image editing accuracy, in this paper, we propose POEM, a framework for Precise Object-level Editing using Multimodal Large Language Models (MLLMs). POEM leverages MLLMs to analyze instructional prompts and generate precise object masks before and after transformation, enabling fine-grained control without extensive user input. This structured reasoning stage guides the diffusion-based editing process, ensuring accurate object localization and transformation. To evaluate our approach, we introduce VOCEdits, a benchmark dataset based on PASCAL VOC 2012, augmented with instructional edit prompts, ground-truth transformations, and precise object masks. Experimental results show that POEM outperforms existing text-based image editing approaches in precision and reliability while reducing manual effort compared to interaction-based methods.

POEM: Precise Object-level Editing via MLLM control

TL;DR

POEM tackles the challenge of precise object-level image editing by decoupling visual reasoning from editing and leveraging Multimodal Large Language Models (MLLMs) to produce exact object masks before and after transformations. The framework comprises a five-step process (visual grounding, refinement, edit parsing, transformation, and edit-guided image-to-image translation) guided by MLLM reasoning, enabling high-precision edits within diffusion models without fine-tuning. To validate the approach, the authors introduce VOCEdits, a PASCAL VOC 2012–based dataset with instructional prompts, ground-truth transformations, and masks, and demonstrate that POEM achieves higher edit fidelity than text-based baselines and reduces manual effort relative to interaction-based methods. The work advances controllable image synthesis by integrating MLLMs with diffusion-based editing and provides a rigorous benchmark for object-level edits, highlighting both its strengths and limitations in extreme or non-rigid transformations.

Abstract

Diffusion models have significantly improved text-to-image generation, producing high-quality, realistic images from textual descriptions. Beyond generation, object-level image editing remains a challenging problem, requiring precise modifications while preserving visual coherence. Existing text-based instructional editing methods struggle with localized shape and layout transformations, often introducing unintended global changes. Image interaction-based approaches offer better accuracy but require manual human effort to provide precise guidance. To reduce this manual effort while maintaining a high image editing accuracy, in this paper, we propose POEM, a framework for Precise Object-level Editing using Multimodal Large Language Models (MLLMs). POEM leverages MLLMs to analyze instructional prompts and generate precise object masks before and after transformation, enabling fine-grained control without extensive user input. This structured reasoning stage guides the diffusion-based editing process, ensuring accurate object localization and transformation. To evaluate our approach, we introduce VOCEdits, a benchmark dataset based on PASCAL VOC 2012, augmented with instructional edit prompts, ground-truth transformations, and precise object masks. Experimental results show that POEM outperforms existing text-based image editing approaches in precision and reliability while reducing manual effort compared to interaction-based methods.

Paper Structure

This paper contains 11 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: POEM. Existing text-based instruction editing methods (top) struggle with precise object-level shape and layout edits. Image interaction-based approaches (middle) perform better but require significant manual user effort. Instead, we propose (bottom) leveraging MLLMs to interpret instructional prompts and automatically generate precise object masks and numerical transformations to support image editing pipelines.
  • Figure 2: Overview of our approach. An image and a user edit prompt are fed into the reasoning stage, where we analyze the scene and extract object-level masks and precise transformation parameters for appearance and shape edits. During the editing stage, we apply these edits during inference without any additional training or fine-tuning.
  • Figure 3: Detailed pipeline of POEM. Given an image and an edit prompt, we first use an MLLM to analyze the scene and identify objects. Then, we refine the detections and enhance object masks using Grounded SAM. Next, we use a text-based LLM to predict the transformation matrix of the initial segmentation mask. Finally, we perform an image-to-image translation guided by the previous steps to generate the edited image. This structured pipeline enables precise object-level editing with high visual fidelity while preserving spatial and visual coherence.
  • Figure 4: VOCEdits evaluation subset statistics. Distributions of (a) object classes, (b) transformation types, and (c) transformation difficulty levels.
  • Figure 5: Qualitative results. We compare POEM with state-of-the-art image editing models across a diverse set of edit instructions, including geometric transformations (e.g., translation, scaling), appearance changes, and combinations of both. The specific prompts used are "Scale the bus by 0.56", "Move the pear left by 150px and make it red", "Scale the mug only vertically to 200px", "Make the sword gold", "Scale the orange by 2 and move it left by 150px", and "Move the ball left by 90px and make it blue".
  • ...and 1 more figures