Action-based image editing guided by human instructions
Maria Mihaela Trusca, Mingxiao Li, Marie-Francine Moens
TL;DR
This work addresses the gap in dynamic, action-based image editing by introducing EditAction, a diffusion-based model built on InstructPix2Pix that can modify object positions or postures according to textual action instructions while preserving object appearance and the scene background. It frames action editing as a supervised task using before/after image pairs and learns through a contrastive objective that distinguishes the correct action from distractors, optimized with the combined loss $L = L_{static} + \lambda_1 L_{action} + \lambda_2 L_{reg}$ and inference-time conditioning via $\hat{\epsilon_\theta} = \epsilon_\theta(\cdot) + s_i(\cdot) + s_c(\cdot)$. The authors create two datasets, LC (MPII-Cooking, fixed camera) and HC (EPIC-Kitchens, moving camera), and demonstrate that EditAction outperforms baselines in implementing actions while preserving background and target objects, including generalization to unseen verbs. They show the model can reason to extend scenes beyond the input frame, with quantitative gains in action accuracy (via TimeSformer) and competitive $FID$ scores, validated by qualitative human judgments. The work highlights the practical potential of action-guided editing for dynamic visual tasks, while noting training-time costs and limitations in long-distance or highly ambiguous actions, suggesting future work to scale action repertoire and efficiency.
Abstract
Text-based image editing is typically approached as a static task that involves operations such as inserting, deleting, or modifying elements of an input image based on human instructions. Given the static nature of this task, in this paper, we aim to make this task dynamic by incorporating actions. By doing this, we intend to modify the positions or postures of objects in the image to depict different actions while maintaining the visual properties of the objects. To implement this challenging task, we propose a new model that is sensitive to action text instructions by learning to recognize contrastive action discrepancies. The model training is done on new datasets defined by extracting frames from videos that show the visual scenes before and after an action. We show substantial improvements in image editing using action-based text instructions and high reasoning capabilities that allow our model to use the input image as a starting scene for an action while generating a new image that shows the final scene of the action.
