Table of Contents
Fetching ...

InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

Nirat Saini, Navaneeth Bodla, Ashish Shrivastava, Avinash Ravichandran, Xiao Zhang, Abhinav Shrivastava, Bharat Singh

TL;DR

InVi tackles the problem of inserting or replacing objects in videos without training on video data by leveraging off-the-shelf latent diffusion models. It introduces a two-stage inpainting and matching pipeline, using an anchor frame and extended-attention to enforce temporal coherence across frames, while avoiding per-video fine-tuning. The method demonstrates superior background fidelity and temporal consistency against baselines on diverse datasets, supported by quantitative metrics and a user study. The approach enables practical, long-form video edits with flexible conditioning, though it relies on 2D bounding boxes and could benefit from automated layout generation in the future.

Abstract

We introduce InVi, an approach for inserting or replacing objects within videos (referred to as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled manipulation of objects and blending them seamlessly into a background video unlike existing video editing methods that focus on comprehensive re-styling or entire scene alterations. To achieve this goal, we tackle two key challenges. Firstly, for high quality control and blending, we employ a two-step process involving inpainting and matching. This process begins with inserting the object into a single frame using a ControlNet-based inpainting diffusion model, and then generating subsequent frames conditioned on features from an inpainted frame as an anchor to minimize the domain gap between the background and the object. Secondly, to ensure temporal coherence, we replace the diffusion model's self-attention layers with extended-attention layers. The anchor frame features serve as the keys and values for these layers, enhancing consistency across frames. Our approach removes the need for video-specific fine-tuning, presenting an efficient and adaptable solution. Experimental results demonstrate that InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.

InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

TL;DR

InVi tackles the problem of inserting or replacing objects in videos without training on video data by leveraging off-the-shelf latent diffusion models. It introduces a two-stage inpainting and matching pipeline, using an anchor frame and extended-attention to enforce temporal coherence across frames, while avoiding per-video fine-tuning. The method demonstrates superior background fidelity and temporal consistency against baselines on diverse datasets, supported by quantitative metrics and a user study. The approach enables practical, long-form video edits with flexible conditioning, though it relies on 2D bounding boxes and could benefit from automated layout generation in the future.

Abstract

We introduce InVi, an approach for inserting or replacing objects within videos (referred to as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled manipulation of objects and blending them seamlessly into a background video unlike existing video editing methods that focus on comprehensive re-styling or entire scene alterations. To achieve this goal, we tackle two key challenges. Firstly, for high quality control and blending, we employ a two-step process involving inpainting and matching. This process begins with inserting the object into a single frame using a ControlNet-based inpainting diffusion model, and then generating subsequent frames conditioned on features from an inpainted frame as an anchor to minimize the domain gap between the background and the object. Secondly, to ensure temporal coherence, we replace the diffusion model's self-attention layers with extended-attention layers. The anchor frame features serve as the keys and values for these layers, enhancing consistency across frames. Our approach removes the need for video-specific fine-tuning, presenting an efficient and adaptable solution. Experimental results demonstrate that InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.
Paper Structure (16 sections, 2 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 2 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: InVi inserts objects into a background video using a foreground mask, a control signal (e.g., pose, canny, depth map), and a text prompt by leveraging off-the-shelf diffusion models. It ensures that the inserted object aligns semantically with the text, is temporally coherent in time, and also conforms spatially to the control signal.
  • Figure 2: InVi inference pipeline: (a) Given a video and object bounding boxes, first, we crop a region around the bounding box which is inpainted. (b) Next, we use a ControlNet-based inpainting diffusion model to inpaint the cropped region in the first frame. (c) To ensure temporal consistency when inpainting subsequent frames, we employ the previous frame as an anchor image. This is achieved by adapting the self-attention block of the denoising U-Net with extended attention. Specifically, we augment the Keys and Values of the current frame being inpainted with those of the anchor frame, allowing for consistent appearance. Finally, the inpainted crop is seamlessly integrated back into the input video.
  • Figure 3: Post-processing to remove flickering square artifacts. a) Background image. b) Initial image generated from our pipeline. c) Zoomed-in view revealing artifacts around the inserted object. d) A trimap is generated to facilitate seamless blending of the object into the background. e) Post-processed frame showcasing the final result after blending the inserted object with the background.
  • Figure 4: User Preference Study: InVi Outperforms Baseline Methods in text alignment, background and temporal appearance consistency and overall video quality.
  • Figure 5: Qualitative results. The first image is a background frame from the video undergoing inpainting. Subsequent frames depict the video with the inserted object.
  • ...and 4 more figures