Table of Contents
Fetching ...

Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran

TL;DR

The paper tackles the challenge of controllable, affordance-aware video composition under data scarcity by introducing Split-then-Merge (StM), a data-driven framework that decomposes unlabeled videos into foreground and background layers and learns to recompose them. A Decomposer constructs a multi-layer dataset (StM-50K) using off-the-shelf models for captioning, motion segmentation, and inpainting, while a Transformer-based Composer (based on CogVideoX-I2V) learns to merge layers with a transformation-aware training pipeline and an identity-preservation loss that balances foreground fidelity with scene harmony. Empirical results, including automated metrics, human studies, and VLLM-based judgments, show StM outperforms state-of-the-art baselines in motion consistency and affordance-aware integration, albeit with a trade-off in textual alignment. The work delivers a scalable pathway to realistic, controllable video composition without manual annotations, with practical implications for content creation and AI-assisted video editing, and highlights avenues for improving text-guided alignment and decomposition robustness.

Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

Layer-Aware Video Composition via Split-then-Merge

TL;DR

The paper tackles the challenge of controllable, affordance-aware video composition under data scarcity by introducing Split-then-Merge (StM), a data-driven framework that decomposes unlabeled videos into foreground and background layers and learns to recompose them. A Decomposer constructs a multi-layer dataset (StM-50K) using off-the-shelf models for captioning, motion segmentation, and inpainting, while a Transformer-based Composer (based on CogVideoX-I2V) learns to merge layers with a transformation-aware training pipeline and an identity-preservation loss that balances foreground fidelity with scene harmony. Empirical results, including automated metrics, human studies, and VLLM-based judgments, show StM outperforms state-of-the-art baselines in motion consistency and affordance-aware integration, albeit with a trade-off in textual alignment. The work delivers a scalable pathway to realistic, controllable video composition without manual annotations, with practical implications for content creation and AI-assisted video editing, and highlights avenues for improving text-guided alignment and decomposition robustness.

Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

Paper Structure

This paper contains 28 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Video Composition via Split-then-Merge.(a) Training: The Decomposer splits an unlabeled video into foreground and background layers and generates a caption, while the Composer learns to merge them for reconstruction. (b) Inference: The Composer integrates a foreground video into novel background videos, and ensures affordance-aware placement (e.g., a pig on a forest road, NYC walkway, or lunar surface) with realistic harmonization (motion, lighting, shadows). Best viewed in color.
  • Figure 2: Video Composition. Given input foreground and background videos, image-based methods (a)--(b) use only the first frame, while (c)--(e) take full video inputs. (a) Object insertion wu2025qwenimagetechnicalreport followed by Image-to-Video (I2V) and (b) end-to-end I2V composition SkyReels ji2025layerflow fails to retain motion due to lack of video access. (c) Manual copy-paste preserves motion but violates affordance (swan placed on ground). (d) Naive generative composition yields appearance and motion drift (e.g., black swan turns white). (e) Our method preserves identity and motion, and achieves affordance-aware placement with realistic blending (swan placed in water with wave and shadows).
  • Figure 3: StM Decomposer. The StM Decomposer integrates off-the-shelf models to split unlabeled videos. First, motion segmentation generates a foreground mask, which is used to extract the foreground layer. An inpainting model then fills the "holes" in the masked background video. Finally, a video captioning model generates a descriptive text caption for the original video.
  • Figure 4: StM Composer Training. The Composer is trained to reconstruct a ground-truth video latent from foreground, background, and text inputs. First, the foreground video is augmented, and all video inputs (augmented foreground, background, ground truth) are encoded into latents by a frozen Space-Time (ST) VAE. The text prompt is encoded as $Z_{text}$. A noisy ground-truth latent (blue) is fused with background (green) and augmented foreground (yellow) latents via a projection layer to produce the visual representation $Z_{vision}$. A Diffusion Transformer then processes $Z_{vision}$ and $Z_{text}$ to predict a composed latent (red). The identity-preservation loss comprises two weighted sub-losses comparing the prediction (red) against the ground truth (blue) using foreground- and background-aware masking.
  • Figure 5: Qualitative comparison. Our method (StM) uniquely preserves complex dynamics and achieves affordance-aware harmony where baselines fail. (Left) StM alone maintains both the rapid background camera motion and realistic foreground running motion. (Center) StM demonstrates affordance by adapting the boat's orientation and height of the waves. (Right) StM accurately preserves the car's semantic action, road alignment, and lighting consistency, unlike alternative methods.
  • ...and 3 more figures