Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara; Yujia Chen; Ming-Hsuan Yang; James M. Rehg; Wen-Sheng Chu; Du Tran

Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran

TL;DR

The paper tackles the challenge of controllable, affordance-aware video composition under data scarcity by introducing Split-then-Merge (StM), a data-driven framework that decomposes unlabeled videos into foreground and background layers and learns to recompose them. A Decomposer constructs a multi-layer dataset (StM-50K) using off-the-shelf models for captioning, motion segmentation, and inpainting, while a Transformer-based Composer (based on CogVideoX-I2V) learns to merge layers with a transformation-aware training pipeline and an identity-preservation loss that balances foreground fidelity with scene harmony. Empirical results, including automated metrics, human studies, and VLLM-based judgments, show StM outperforms state-of-the-art baselines in motion consistency and affordance-aware integration, albeit with a trade-off in textual alignment. The work delivers a scalable pathway to realistic, controllable video composition without manual annotations, with practical implications for content creation and AI-assisted video editing, and highlights avenues for improving text-guided alignment and decomposition robustness.

Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

Layer-Aware Video Composition via Split-then-Merge

TL;DR

Abstract

Layer-Aware Video Composition via Split-then-Merge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)