Table of Contents
Fetching ...

Action-based image editing guided by human instructions

Maria Mihaela Trusca, Mingxiao Li, Marie-Francine Moens

TL;DR

This work addresses the gap in dynamic, action-based image editing by introducing EditAction, a diffusion-based model built on InstructPix2Pix that can modify object positions or postures according to textual action instructions while preserving object appearance and the scene background. It frames action editing as a supervised task using before/after image pairs and learns through a contrastive objective that distinguishes the correct action from distractors, optimized with the combined loss $L = L_{static} + \lambda_1 L_{action} + \lambda_2 L_{reg}$ and inference-time conditioning via $\hat{\epsilon_\theta} = \epsilon_\theta(\cdot) + s_i(\cdot) + s_c(\cdot)$. The authors create two datasets, LC (MPII-Cooking, fixed camera) and HC (EPIC-Kitchens, moving camera), and demonstrate that EditAction outperforms baselines in implementing actions while preserving background and target objects, including generalization to unseen verbs. They show the model can reason to extend scenes beyond the input frame, with quantitative gains in action accuracy (via TimeSformer) and competitive $FID$ scores, validated by qualitative human judgments. The work highlights the practical potential of action-guided editing for dynamic visual tasks, while noting training-time costs and limitations in long-distance or highly ambiguous actions, suggesting future work to scale action repertoire and efficiency.

Abstract

Text-based image editing is typically approached as a static task that involves operations such as inserting, deleting, or modifying elements of an input image based on human instructions. Given the static nature of this task, in this paper, we aim to make this task dynamic by incorporating actions. By doing this, we intend to modify the positions or postures of objects in the image to depict different actions while maintaining the visual properties of the objects. To implement this challenging task, we propose a new model that is sensitive to action text instructions by learning to recognize contrastive action discrepancies. The model training is done on new datasets defined by extracting frames from videos that show the visual scenes before and after an action. We show substantial improvements in image editing using action-based text instructions and high reasoning capabilities that allow our model to use the input image as a starting scene for an action while generating a new image that shows the final scene of the action.

Action-based image editing guided by human instructions

TL;DR

This work addresses the gap in dynamic, action-based image editing by introducing EditAction, a diffusion-based model built on InstructPix2Pix that can modify object positions or postures according to textual action instructions while preserving object appearance and the scene background. It frames action editing as a supervised task using before/after image pairs and learns through a contrastive objective that distinguishes the correct action from distractors, optimized with the combined loss and inference-time conditioning via . The authors create two datasets, LC (MPII-Cooking, fixed camera) and HC (EPIC-Kitchens, moving camera), and demonstrate that EditAction outperforms baselines in implementing actions while preserving background and target objects, including generalization to unseen verbs. They show the model can reason to extend scenes beyond the input frame, with quantitative gains in action accuracy (via TimeSformer) and competitive scores, validated by qualitative human judgments. The work highlights the practical potential of action-guided editing for dynamic visual tasks, while noting training-time costs and limitations in long-distance or highly ambiguous actions, suggesting future work to scale action repertoire and efficiency.

Abstract

Text-based image editing is typically approached as a static task that involves operations such as inserting, deleting, or modifying elements of an input image based on human instructions. Given the static nature of this task, in this paper, we aim to make this task dynamic by incorporating actions. By doing this, we intend to modify the positions or postures of objects in the image to depict different actions while maintaining the visual properties of the objects. To implement this challenging task, we propose a new model that is sensitive to action text instructions by learning to recognize contrastive action discrepancies. The model training is done on new datasets defined by extracting frames from videos that show the visual scenes before and after an action. We show substantial improvements in image editing using action-based text instructions and high reasoning capabilities that allow our model to use the input image as a starting scene for an action while generating a new image that shows the final scene of the action.

Paper Structure

This paper contains 17 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparison between EditAction and the baselines on the LC and HC datasets. Unlike the baselines, our model can implement actions while preserving the background and the appearance of the objects involved in the action. More examples are presented in Figure \ref{['fig:appendix_main_figure']} (in Appendix).
  • Figure 2: Left side: HC dataset; Right side: LC dataset. Red indicates the object targeted by the action. Green and blue shows the starting and the ending points of the action.
  • Figure 3: Training of the U-Net model employed by EditAction. Given an input and an edited image, the model is trained to enhance alignment with the action-based text instruction while preventing hallucinations that may result from fine-tuning on small-scale datasets.
  • Figure 4: Illustrations of EditAction's reasoning capabilities, which allow it to extend the scene of the input image based on the action-based text command.
  • Figure 5: Despite being trained on a limited number of actions, EditAction can implement actions using verbs unseen during the training, such as "wipe" for the LC dataset and "place" for the HC dataset.
  • ...and 3 more figures