Instruction-based Image Manipulation by Watching How Things Move
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia
TL;DR
This work addresses the bottleneck of teaching models to perform complex, instruction-based image edits on real photos. It introduces InstructMove, trained on a dataset built from video frame pairs where editing instructions are generated by multimodal LLMs, enabling robust non-rigid transformations and viewpoint changes while preserving identity. A key contribution is the spatial conditioning strategy that concatenates the reference image with the noisy target latent along the width, allowing cross-attention to preserve content without architectural changes. Experimental results on a new non-rigid editing benchmark show state-of-the-art performance in pose, rearrangement, and viewpoint edits, with strong support from human judgments and compatibility with additional controls like masks and ControlNet. This approach offers a scalable pathway to more realistic, controllable image editing driven by natural video-derived supervision and multimodal reasoning.
Abstract
This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
