Table of Contents
Fetching ...

Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song

TL;DR

This work tackles reference-guided video composition with semantic disparities beyond simple color/lighting adjustments by introducing a training-free pipeline that leverages a pre-trained diffusion model. It processes videos frame-by-frame in a two-stage inversion-generation cascade, using Balanced Partial Inversion to initialize generation with a latent $z^c_{i,t_b}$ at step $t_b$ (where $t_b \in (0,T)$ and $T=20$) and applying Inter-Frame Augmented attention (IFA) to propagate foreground continuity via the previous frame $I^h_{i-1}$, along with a background replacement to preserve the reference scene. The method demonstrates improved harmony and inter-frame coherence across shallow and deep semantic disparities, outperforming baselines in balancing temporal consistency and semantic alignment. This training-free approach, built on Stable Diffusion, offers practical significance for video editing and creative synthesis, with potential extensions to multi-object composition and broader semantic domains, while preserving input foreground characteristics through a controllable inversion strategy.

Abstract

The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities. Specifically, we process the video frames in a cascading manner and handle each frame in two processes with the diffusion model. In the inversion process, we propose Balanced Partial Inversion to obtain generation initial points that balance reversibility and modifiability. Then, in the generation process, we further propose Inter-Frame Augmented attention to augment foreground continuity across frames. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs, demonstrating efficacy in managing broader semantic disparities.

Training-Free Semantic Video Composition via Pre-trained Diffusion Model

TL;DR

This work tackles reference-guided video composition with semantic disparities beyond simple color/lighting adjustments by introducing a training-free pipeline that leverages a pre-trained diffusion model. It processes videos frame-by-frame in a two-stage inversion-generation cascade, using Balanced Partial Inversion to initialize generation with a latent at step (where and ) and applying Inter-Frame Augmented attention (IFA) to propagate foreground continuity via the previous frame , along with a background replacement to preserve the reference scene. The method demonstrates improved harmony and inter-frame coherence across shallow and deep semantic disparities, outperforming baselines in balancing temporal consistency and semantic alignment. This training-free approach, built on Stable Diffusion, offers practical significance for video editing and creative synthesis, with potential extensions to multi-object composition and broader semantic domains, while preserving input foreground characteristics through a controllable inversion strategy.

Abstract

The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities. Specifically, we process the video frames in a cascading manner and handle each frame in two processes with the diffusion model. In the inversion process, we propose Balanced Partial Inversion to obtain generation initial points that balance reversibility and modifiability. Then, in the generation process, we further propose Inter-Frame Augmented attention to augment foreground continuity across frames. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs, demonstrating efficacy in managing broader semantic disparities.
Paper Structure (15 sections, 2 equations, 8 figures, 1 table)

This paper contains 15 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Comparison of our methods with previous methods. Above: Training and inference process of previous methodsCO22022deep. They perform poorly facing deep semantic disparities. Below: Our training-free pipeline. We achieve satisfactory results in both color and lighting adjustments and deep semantic transformation. More cases can be found in https://anonymous.4open.science/r/paper130.
  • Figure 1: Compositing results using the latent of different inversion steps as the initial point. The complete inversion process takes 20 steps. Two examples are shown, each showing one frame. The best results are marked with red boxes.
  • Figure 2: Our proposed training-free pipeline. We process the composite video $V^c$ frame-by-frame in a cascading manner, as shown in the orange box at the top of the figure. The yellow box illustrates our process for each frame. Specifically, we employ the Stable Diffusionsd2022high to process frame $i$ in two processes: inversion and generation. During the inversion process, we invert the $I^c_i$ in $t_b$ steps to obtain an initial point $z^c_{i, t_b}$ using Balanced Partial Inversion (BPI). Then, we start the generation process from this initial point. During the generation process, the processed previous frame $I^h_{i-1}$ affects the current frame through the Inter-Frame Augmented attention (IFA) to associate frames with each other, which is shown in the blue box.
  • Figure 2: Compositing results with different operating range of IFA. The complete generation process takes 20 steps. In this case, $t_b=15$. The best results are marked with a red box.
  • Figure 3: Image reconstruction using the latent of different inversion steps as the initial point. The complete inversion process takes $T=20$ steps. The reconstructed image generated from the initial point with fewer inversion steps will retain more characteristics of the input.
  • ...and 3 more figures