Table of Contents
Fetching ...

Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia

TL;DR

This work addresses the bottleneck of teaching models to perform complex, instruction-based image edits on real photos. It introduces InstructMove, trained on a dataset built from video frame pairs where editing instructions are generated by multimodal LLMs, enabling robust non-rigid transformations and viewpoint changes while preserving identity. A key contribution is the spatial conditioning strategy that concatenates the reference image with the noisy target latent along the width, allowing cross-attention to preserve content without architectural changes. Experimental results on a new non-rigid editing benchmark show state-of-the-art performance in pose, rearrangement, and viewpoint edits, with strong support from human judgments and compatibility with additional controls like masks and ControlNet. This approach offers a scalable pathway to more realistic, controllable image editing driven by natural video-derived supervision and multimodal reasoning.

Abstract

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

Instruction-based Image Manipulation by Watching How Things Move

TL;DR

This work addresses the bottleneck of teaching models to perform complex, instruction-based image edits on real photos. It introduces InstructMove, trained on a dataset built from video frame pairs where editing instructions are generated by multimodal LLMs, enabling robust non-rigid transformations and viewpoint changes while preserving identity. A key contribution is the spatial conditioning strategy that concatenates the reference image with the noisy target latent along the width, allowing cross-attention to preserve content without architectural changes. Experimental results on a new non-rigid editing benchmark show state-of-the-art performance in pose, rearrangement, and viewpoint edits, with strong support from human judgments and compatibility with additional controls like masks and ControlNet. This approach offers a scalable pathway to more realistic, controllable image editing driven by natural video-derived supervision and multimodal reasoning.

Abstract

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

Paper Structure

This paper contains 18 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We propose InstructMove, an instruction-based image editing model trained on frame pairs from videos with instructions generated by Multimodal LLMs. Our model excels at non-rigid editing, such as adjusting subject poses, expressions, and altering viewpoints, while maintaining content consistency. Additionally, our method supports precise, localized edits through the integration of masks, human poses, and other control mechanisms.
  • Figure 2: Existing methods struggle with complex edits on real photos, such as non-rigid transformations. They often either fail to follow the editing instructions or produce inconsistent outputs.
  • Figure 3: Our data construction pipeline. (a) We begin by sampling suitable frame pairs from videos, ensuring realistic and moderate transformations. (b) These frame pairs are used to prompt Multimodal Large Language Models (MLLMs) to generate detailed editing instructions. (c) This process results in a large-scale dataset with realistic image pairs and precise editing instructions.
  • Figure 4: Overview of the proposed model architecture for instruction-based image editing. The source and target images are first encoded into latent representations $z^s$ and $z^e$ using a pretrained encoder. The target latent $z^e$ is then transformed into a noisy latent $z^e_t$ through the forward diffusion process. We concatenate the source image latent and the noisy target latent along the width dimension to form the model input, which is fed into the denoising U-Net $\epsilon_\theta$ to predict a noise map. The right half of the output, corresponding to the noisy target input, is cropped and compared with the original noise map.
  • Figure 5: Qualitative comparison with state-of-the-art image editing methods, including both description-based and instruction-based approaches. Existing methods struggle with complex edits such as non-rigid transformations (e.g., changes in pose and expression), object repositioning, or viewpoint adjustments. They often either fail to follow the editing instructions or produce images with inconsistencies, such as identity shifts. In contrast, our method, trained on real video frames with naturalistic transformations, successfully handles these edits while maintaining consistency with the original input images.
  • ...and 3 more figures