Training-Free Semantic Video Composition via Pre-trained Diffusion Model
Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song
TL;DR
This work tackles reference-guided video composition with semantic disparities beyond simple color/lighting adjustments by introducing a training-free pipeline that leverages a pre-trained diffusion model. It processes videos frame-by-frame in a two-stage inversion-generation cascade, using Balanced Partial Inversion to initialize generation with a latent $z^c_{i,t_b}$ at step $t_b$ (where $t_b \in (0,T)$ and $T=20$) and applying Inter-Frame Augmented attention (IFA) to propagate foreground continuity via the previous frame $I^h_{i-1}$, along with a background replacement to preserve the reference scene. The method demonstrates improved harmony and inter-frame coherence across shallow and deep semantic disparities, outperforming baselines in balancing temporal consistency and semantic alignment. This training-free approach, built on Stable Diffusion, offers practical significance for video editing and creative synthesis, with potential extensions to multi-object composition and broader semantic domains, while preserving input foreground characteristics through a controllable inversion strategy.
Abstract
The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities. Specifically, we process the video frames in a cascading manner and handle each frame in two processes with the diffusion model. In the inversion process, we propose Balanced Partial Inversion to obtain generation initial points that balance reversibility and modifiability. Then, in the generation process, we further propose Inter-Frame Augmented attention to augment foreground continuity across frames. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs, demonstrating efficacy in managing broader semantic disparities.
