Table of Contents
Fetching ...

Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

Nikolai Warner, Jack Kolb, Meera Hahn, Vighnesh Birodkar, Jonathan Huang, Irfan Essa

TL;DR

This work tackles making complex, non-rigid edits of human subjects into new scenes while preserving identity, by finetuning an inpainting diffusion model conditioned on a reference image, 2D pose, and scene-difference captions. It leverages multimodal language models to derive noisy, yet informative, captions from video frames and combines weak supervision with pose cues to improve person-object interactions in-the-wild. The approach outperforms image-only baselines and re-implemented prior methods in terms of controllability and interaction realism, albeit with trade-offs in photorealism and identity preservation under challenging scenes. The results advance intuitive, user-centric editing of human subjects in complex environments, while highlighting ethical considerations and the need for safeguards in deployment.

Abstract

In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.

Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

TL;DR

This work tackles making complex, non-rigid edits of human subjects into new scenes while preserving identity, by finetuning an inpainting diffusion model conditioned on a reference image, 2D pose, and scene-difference captions. It leverages multimodal language models to derive noisy, yet informative, captions from video frames and combines weak supervision with pose cues to improve person-object interactions in-the-wild. The approach outperforms image-only baselines and re-implemented prior methods in terms of controllability and interaction realism, albeit with trade-offs in photorealism and identity preservation under challenging scenes. The results advance intuitive, user-centric editing of human subjects in complex environments, while highlighting ethical considerations and the need for safeguards in deployment.

Abstract

In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Given a single image, multiple controllable identity-preserving edits can be specified with different text captions. Given a masked insertion scene and reference image containing a person to insert, our fine-tuned model inserts them into the scene controllable as controlled by a given text caption. Text and image-based inference on unseen images using an image + text model. Where captions are unavailable, we train the text and image model on a blank caption.
  • Figure 2: Complex edits are achievable through weakly annotated supervision of a portion of the overall dataset, and finetuning jointly on text. Given a masked scene to insert a person into ("Scene") and a segmented crop of the person ("Ref"), the person is inserted into the scene. For comparison, we also provide the ground truth image that is paired with the caption and input. Given a scene, and person to insert, our model is capable of multiple non-rigid edits that preserve the identity of the person, despite a relatively small set of 13,487 weakly annotated captioned image pairs out of 78,000 videos. See Appendix Table 1 for dataset details and Section \ref{['sec:three_three']} for details on our weakly annotated captions.
  • Figure 3: Comparison of our approach to baselines for identity preservation and controllability of in-the-wild images. The input image is controlled using the prompts below each row, and the human subject is transferred to new frames for relevant models (our's and Kulal et al.). No baseline achieves comparable identity-preserving non-rigid edits on in-the-wild data. Baselines either insert a person without controllability (Kulal et al.), or are controllable but fail to generalize to in-the-wild images (MASACtrl, PIDM). Our approach maintains similar photorealism to Kulal et al., with improved controllability. * PIDM works well on its training dataset but is brittle in the wild, likely due to its fashion-related training dataset. † Kulal et al. results are from the re-trained model with image conditioning only.
  • Figure 4: System diagram illustrating the process of generating a desired edit using multiple inputs including noise target latent, binary mask, masked target latent, reference image, and change of scene prompt. The Affordance Diffusion Network on the right is the formulation proposed by kulal2023putting, our improvements to controllability come from the framework described on the left. We study combinations of pose and weakly annotated text conditioning to learn more controllable and complex image edits that still preserve the identity of a person in a scene. No other work to our knowledge combines controllable non-rigid edits with identity preservation, and works in the wild.
  • Figure 5: Using pose conditioning is insufficient to specify person-object interactions. Combining textual embeddings allows the model to contextualize how the person interacts with their surroundings. For example, the exercise ball or bicycle are faithfully maintained with text on Rows 3 and 4, but not with just pose. We show qualitative improvement of person-object interactions by conditioning on weakly generated captions, combined with reference images and pose data. We use the pose from the ground truth combined with the pose from the reference image for conditioning, fed through a single learnable dense layer.